In my years as a Data Scientist, and working with data in general, I’ve picked up a lot of tips and tricks along the way. My latest blog series is looking to share some of those I’ve found particularly helpful, with the hope that you could apply some of these to make your day-to-day work life a little easier.
The first blog looked back to the beginning of any data
science project, data exploration using Oracle ADS, you can read this here.
My second blog in the series will take you through the next stage, Data Preparation.
Data Preparation is
where we look enrich our dataset and transform it to suit our needs.
This might be that we look to drop certain fields, clean the data depending on
our requirements, we might also need to transform the data (e.g., one hot
encoding) or create new features.
I wanted to share some helpful functions and tips that I’ve found to help with this.
Load in our data
For this example, I will use the popular Titanic dataset, which you can download from Kaggle here. For the purposes of testing, we’ll just use the test dataset in this example.
Understand missing values?
In the Data Exploration stage, we identified that a number
of fields contain missing values – we’ll want to make sure to handle these appropriately
depending on what our requirements are.
A quick way to understand missing data is to draw a table,
we can do so using some code similar to the below:
This table will reference our Titanic dataframe and sum all nulls for each variable within the dataset, and sort it in ascending order.
This also includes a percentage of the entire dataset, to
provide some context to our raw figures. This again may help us to decide on
what to do with the missing data.
As we’ve created this as a function, it’s simple to apply
and see the results.
We find that around 77% of records in our dataset have a null value for Cabin, and that almost 20% of records have a null value for Age.
This tallies with Oracle ADS values presented in “Warnings”
from the previous blog, so it provides additional confidence that the function
is working correctly.
Dropping missing values
To drop variables from the dataset, we can use the drop
function from Pandas as below.
It is possible to drop either a row or a column, in this
case we’re dropping the column, Cabin. To ensure this is the case, we
include “axis=1” argument, which is used to specify that we’re dropping the
column (and not a row).
It’s possible to remove multiple values at once, by using the drop function with square braces. You’ll notice that it’s very similar to the above, but specifies two fields, rather than one.
Best practice would be to run a quick check to make sure
the columns have been deleted as we expected – and they have.
Imputing missing values
Imputing values can be useful in some cases, where you make
an “educated estimate” on what the missing value should be, based on
information that you do know.
A simple example of this is imputing Age.
We can impute values in various ways, we could simply
provide a random number, but we improve on this.
We could provide the mean age (average) or median age
(middle point of the dataset). Which one we use can be determined by the
business problem or question we’re looking to solve.
We can use the describe function to provide some of
this information for us:
We can see here the mean age is 30 – so we could
simply apply the mean age to all missing ages.
To apply this change, we can utilise another function.
Now we can check to make sure age has no missing values, and it doesn’t, so we have successfully imputed that value.
We could take this one step further and look to impute ages
based on a group of “similar” passengers.
Within the Titanic dataset, there are three class types (1,
2 and 3). We could impute ages by calculating the mean age of each class and
apply that to the appropriate classes.
This utilises a group by, and we can group by Pclass and
Age, to then calculate the mean value for these groups.
This shows there is variation in ages by class – those in
class 1 tend to be older than those in classes 2 and 3, so there’s additional
context to our data, which we would miss when just applying the mean age of the
entire dataset.
This is just one example of enriching our dataset, to provide
us with a more detailed and thorough dataset, which in turn we hope to return better
outcomes from.
Comments
Post a Comment