I used two main tables: one table held information about the users and the other had all the message they sent.
User Data
I received a json file that contained about 50 columns of information for over 35,000 users. My first pass through the data identified or created useful information by...
- Removing Columns
- Turning Columns into Binary Values
- Extracting Useful Information
- Simple Feature Engineering
- Overall User Engagement
Some columns had useless or duplicate information, such as a repeated user id or read/write permissions (they were all set to true for everyone). Other columns had personal information like someone's name.
For example, I had the facebook id for about 60% of users. I'm not interested in useing the facebook id iteslf, but it might be useful to know who entered their facebook id, so I turned it into a 1 (information included) or 0 (information not included).
From the 'profile created' time stamp I could extract information about the time of day or day of week when the user created their profile. From their email, I could see what email service they used (gmail was the most popular).
Some features were more difficult to get out. For example, information about neighborhoods were kept in a separate table. Users could select more than one neighborhood so I couldn't simply join the tables on the neighborhood id.
For example, we can get a rent range from the min and max rent information.
By counting up how many optional value the user included, we can get a rough idea of how engaged they will be with the app.
At the end, I had about 65 columns of information on each of 25,000 users.
Message Data
The message data was a bit harder to tease apart. It appeared to be
a mix of two tables, which I called the conversation table and
the last_message table. Even with the pandas read_json
function, there was a lot to do to get the data grouped apropriately
by conversation.
- Get Conversations
- Get Information about the First Message
- Get User IDs
- Simple Feature Engineering
- Restricting Dates
The original json had two parts, which I called the "conversations" part, and "users" part (not user information, alas). I just needed the first part.
Not every message had every feature, so I had to find what features there were all together, and then pull out whatever information the first message in each conversation could provide.
Obviously it is imporatnt to know who the two users are in each conversation but this information wasn't provided directly. Instead, the two ten digit user ids were smushed together as twenty digit conversation identifier, and I had to work out which ten were the sender and which ten were the receiver.
Things like length of the first message and converting dates into a useful format.
Anything before March 1st didn't have a reliable message sent each day, and anything after Sept 30 was saved for a final test case.
I ended up with over 10,000 conversations, with 18 pieces of information about each.