I used two main tables: one table held information about the users and the other had all the message they sent.

User Data

I received a json file that contained about 50 columns of information for over 35,000 users. My first pass through the data identified or created useful information by...

  1. Removing Columns
  2. Some columns had useless or duplicate information, such as a repeated user id or read/write permissions (they were all set to true for everyone). Other columns had personal information like someone's name.

  3. Turning Columns into Binary Values
  4. For example, I had the facebook id for about 60% of users. I'm not interested in useing the facebook id iteslf, but it might be useful to know who entered their facebook id, so I turned it into a 1 (information included) or 0 (information not included).

  5. Extracting Useful Information
  6. From the 'profile created' time stamp I could extract information about the time of day or day of week when the user created their profile. From their email, I could see what email service they used (gmail was the most popular).

    Some features were more difficult to get out. For example, information about neighborhoods were kept in a separate table. Users could select more than one neighborhood so I couldn't simply join the tables on the neighborhood id.

  7. Simple Feature Engineering
  8. For example, we can get a rent range from the min and max rent information.

  9. Overall User Engagement
  10. By counting up how many optional value the user included, we can get a rough idea of how engaged they will be with the app.

At the end, I had about 65 columns of information on each of 25,000 users.



Message Data

The message data was a bit harder to tease apart. It appeared to be a mix of two tables, which I called the conversation table and the last_message table. Even with the pandas read_json function, there was a lot to do to get the data grouped apropriately by conversation.

  1. Get Conversations
  2. The original json had two parts, which I called the "conversations" part, and "users" part (not user information, alas). I just needed the first part.

  3. Get Information about the First Message
  4. Not every message had every feature, so I had to find what features there were all together, and then pull out whatever information the first message in each conversation could provide.

  5. Get User IDs
  6. Obviously it is imporatnt to know who the two users are in each conversation but this information wasn't provided directly. Instead, the two ten digit user ids were smushed together as twenty digit conversation identifier, and I had to work out which ten were the sender and which ten were the receiver.

  7. Simple Feature Engineering
  8. Things like length of the first message and converting dates into a useful format.

  9. Restricting Dates
  10. Anything before March 1st didn't have a reliable message sent each day, and anything after Sept 30 was saved for a final test case.

I ended up with over 10,000 conversations, with 18 pieces of information about each.