Data Science Toolkit: Part 3 - Data Enrichment

Author: Philip Godfrey

In my years as a Data Scientist, and working with data in general, I’ve picked up a lot of tips and tricks along the way. My latest blog series is looking to share some of those I’ve found particularly helpful, with the hope that you could apply some of these to make your day-to-day work life a little easier.

My second blog in the series looked at all things Data Preparation, if you missed it, you can read it here.

My third and latest blog in the series focuses on all things Data Enrichment.

Data Enrichment is often included in the Data Preparation stage, where we look enrich our dataset and transform it to suit our needs.

The preparation element looks to drop certain fields, clean the data depending on our requirements, but the enrichment stage can include creating new features (adding in new fields), or transforming the existing data into something else (e.g., one hot encoding) to create new features.

I wanted to share some helpful functions and tips that I’ve found to help with this.


Load in our data

As previously, I will use the popular Titanic dataset, which you can download from Kaggle here. For the purposes of testing, we’ll just use the test dataset in this example.


Imputing missing values

In the previous blog, we reviewed imputing values where you make an “educated estimate” on what the missing value should be, based on information that you do know. The example we walked through was imputing Age.   

Within the Titanic dataset, there are three class types (1, 2 and 3). We looked to impute missing ages by calculating the mean age of each class and apply that to the appropriate classes.

This utilises a group by, and we can group by Pclass and Age, to then calculate the mean value for these groups.


This shows there is variation in ages by class – those in class 1 tend to be older than those in classes 2 and 3, so there’s additional context to our data, which we would miss when just applying the mean age of the entire dataset.

We could create this as a new column entirely, so the original age remains as it was, with the mean age provided to all missing values. And this new field “PClass Age” becomes an enrichment.

 

Creating new columns and features

As well as the above example, we can extract information from existing columns, such as extracting Title from the Name field, which might give us context to males, females.


We can see the name includes “Mr.” or “Mrs.” which could give us context to whether each passenger was male or female.

This utilizes a regular expression to identify any pattern that matches any letter with a “.” E.g., we’re hoping to find “Mr.”, “Mrs.” etc. 


This should create a new attribute column named Title, which should have this extracted information from the Name field.

We can check if this has been created successfully, using the head function, which it has.


We can then look take a further look into these values, to understand what the dataset contains, using the unique function.

We can see we have a variety of titles, which may be useful for consideration to help identify types of passengers, but ultimately providing a richer dataset for us to work with.


Comments