Author: Philip Godfrey
In my years as a Data Scientist, and working with data in general, I’ve picked up a lot of tips and tricks along the way. My latest blog series is looking to share some of those I’ve found particularly helpful, with the hope that you could apply some of these to make your day-to-day work life a little easier.
My second blog in the series looked at all things Data
Preparation, if you missed it, you can read it here.
My third and latest blog in the series focuses on all
things Data Enrichment.
Data Enrichment is often included in the Data Preparation stage, where we look enrich our dataset and transform it to suit our needs.
The preparation element looks to drop certain
fields, clean the data depending on our requirements, but the enrichment
stage can include creating new features (adding in new fields), or transforming
the existing data into something else (e.g., one hot encoding) to create new
features.
I wanted to share some helpful functions and tips that I’ve
found to help with this.
Load in our data
As previously, I will use the popular Titanic dataset, which you can download from Kaggle here. For the purposes of testing, we’ll just use the test dataset in this example.
Imputing missing values
In the previous blog, we reviewed imputing values where you make an “educated
estimate” on what the missing value should be, based on information that you do
know. The example we walked through was imputing Age.
Within the Titanic dataset, there are three class types (1, 2 and 3). We looked to impute missing ages by calculating the mean age of each class and apply that to the appropriate classes.
This utilises a group by, and we can group by Pclass and
Age, to then calculate the mean value for these groups.
This shows there is variation in ages by class – those in
class 1 tend to be older than those in classes 2 and 3, so there’s additional
context to our data, which we would miss when just applying the mean age of the
entire dataset.
We could create this as a new column entirely, so the original
age remains as it was, with the mean age provided to all missing values. And
this new field “PClass Age” becomes an enrichment.
Creating new columns and features
As well as the above example, we can extract information
from existing columns, such as extracting Title from the Name
field, which might give us context to males, females.
We can see the name includes “Mr.” or “Mrs.” which could
give us context to whether each passenger was male or female.
This utilizes a regular expression to identify any pattern that
matches any letter with a “.” E.g., we’re hoping to find “Mr.”, “Mrs.” etc.
This should create a new attribute column named Title,
which should have this extracted information from the Name field.
We can check if this has been created successfully, using
the head function, which it has.
We can then look take a further look into these values, to understand what the dataset contains, using the unique function.
We can see we have a variety of titles, which may be useful for consideration to help identify types of passengers, but ultimately providing a richer dataset for us to work with.
Comments
Post a Comment