How Did I Fill Blanks?

Of the data I was interested in, 42% of the values were blank.

Some were blank because that users hadn't reached a stage. If a user stopped after filling in information about their house (stage 3) then I won’t have any information about their employment (stage 4).

Others were blank becuase the users skipped that field. If a user didn’t want to enter their birthdate then I won’t have any information about their age.

I considered three ways of handling these values:

I tested each method for each of four models (logistic regression, random forest, gradient boosting, and adaboost), under two scenarios (all users tested together, users tested separately by stages).

blanks, together blanks, separately

Observations:

  1. It’s much better to predict on each stage separately.
  2. The method of filling in blanks doesn’t really matter when we predict on each stage separately, but it’s hugely important when group all stages together. This is probably a result of having so many fewer nan values.
  3. Filling in values randomly does the worst. This was a disappointment to me. Conceptually, this was the method I liked most.


Acme Co

How Did I

Choose a Metric?

How Did I

Fill Blanks?

How Did I

Calculate Probabilities?

How Did I

Display Predictions?