How Did I Fill Blanks?

Of the data I was interested in, 42% of the values were blank.

Some were blank because that users hadn't reached a stage. If a user stopped after filling in information about their house (stage 3) then I won’t have any information about their employment (stage 4).

Others were blank becuase the users skipped that field. If a user didn’t want to enter their birthdate then I won’t have any information about their age.

I considered three ways of handling these values:

Randomly: Fill in blank values with a value drawn at random from the possible values in the training data, weighted by frequency.
Average: Fill in blank values with the average of the values in the training data.
Default: Fill in blank values with a default value determined based on human (my) judgement.

I tested each method for each of four models (logistic regression, random forest, gradient boosting, and adaboost), under two scenarios (all users tested together, users tested separately by stages).

Observations:

It’s much better to predict on each stage separately.
The method of filling in blanks doesn’t really matter when we predict on each stage separately, but it’s hugely important when group all stages together. This is probably a result of having so many fewer nan values.
Filling in values randomly does the worst. This was a disappointment to me. Conceptually, this was the method I liked most.

How Did I Fill Blanks?

Acme Co

How Did I

Choose a Metric?

How Did I

Fill Blanks?

How Did I

Calculate Probabilities?

How Did I

Display Predictions?