What is the goal of this project?

Using data from on online retailer, identify which cases may be fraudlent. Then build a web app to display predictions.

Context

This was the last case study I worked on at Galvanize. It was amazing to see how far we had come from our first case study back in November. My primary role was at the begining of the project, cleaning data, doing some exploratory data analysis, and creating the model we used to predict fraud.

What tools did you use?

I analyzed the data with python using pandas, numpy, and sklearn.

Detect Murder Profit


THE DATA

The training dataset is made of 14,337 events, of which 1288 are identified by the company as fraud (8.98%). There are 44 columns of information about the event.



DATA CLEANING

Pandas has a great tool for reading Json files, so most of the data was unpacked in a single line as


    df = pd.read_json('/data/data.json')

There were a couple columns that had nested json, such as "ticket_types". A typical event might look like:


    [{'availability': 1,
    'cost': 25.0,
    'event_id': 527017,
    'quantity_sold': 0,
    'quantity_total': 800},
    {'availability': 1,
    'cost': 50.0,
    'event_id': 527017,
    'quantity_sold': 0,
    'quantity_total': 100},
    {'availability': 1,
    'cost': 550.0,
    'event_id': 527017,
    'quantity_sold': 0,
    'quantity_total': 20}]

To turn this into useful information, I wrote helper functions to calculate the total possible profit and the number of tickets for each event. I could then use simple division to find the average ticket price, and the lenght function gave me the number of ticket types.


    def get_max_profit(row):
        return sum(level['cost']*level['quantity_total'] for level in row)
    def get_num_tickets(row):
        return sum(level['quantity_total'] for level in row)

    df['ticket_types_num_types'] = df.ticket_types.apply(len)
    df['ticket_types_max_profit'] = df.ticket_types.apply(get_max_profit)
    df['ticket_types_num_tickets'] = df.ticket_types.apply(get_num_tickets)
    df['ticket_types_avg_price'] = df.ticket_types_max_profit/df.ticket_types_num_tickets

I used a similar method to get information about previous payouts, as well as some simple datetime-ing to get information about the hour, day, and date the event was created.

The last step in data cleaning was to fill in NaN values with the mean from that column. This was only an issue for "has header", "average ticket price", and "average payout".



BUILDING A SIMPLE MODEL

Fittling a model was fairly straightforward. I split the data into train and test sets using sklearn's model selection "train_test_split".

I also oversampled from the fraud cases because they made such a small proportion of the total (less than 10%). By oversampling I can create a more balanced dataset.

I ran the data through sklearn's GradientBoostingClassifier (my favorite). Even without fine tuning the parameters, I was able to get a very impressive 87% recall, 93% precision, and 98% accuracy.

At this point, I handed the cleaned data and model to another team member to gridsearch, and I took a look at the most significant features.



INVESTIGATING RESULTS

From my simple model, the most important features were: