Data Science Toolkit - Part 2: Data Preparation

In my years as a Data Scientist, and working with data in general, I’ve picked up a lot of tips and tricks along the way. My latest blog series is looking to share some of those I’ve found particularly helpful, with the hope that you could apply some of these to make your day-to-day work life a little easier.

The first blog looked back to the beginning of any data science project, data exploration using Oracle ADS, you can read this here.

My second blog in the series will take you through the next stage, Data Preparation.

Data Preparation is where we look enrich our dataset and transform it to suit our needs. This might be that we look to drop certain fields, clean the data depending on our requirements, we might also need to transform the data (e.g., one hot encoding) or create new features.

I wanted to share some helpful functions and tips that I’ve found to help with this.

Load in our data

For this example, I will use the popular Titanic dataset, which you can download from Kaggle here. For the purposes of testing, we’ll just use the test dataset in this example.

Understand missing values?

In the Data Exploration stage, we identified that a number of fields contain missing values – we’ll want to make sure to handle these appropriately depending on what our requirements are.

A quick way to understand missing data is to draw a table, we can do so using some code similar to the below:

This table will reference our Titanic dataframe and sum all nulls for each variable within the dataset, and sort it in ascending order.

This also includes a percentage of the entire dataset, to provide some context to our raw figures. This again may help us to decide on what to do with the missing data.

As we’ve created this as a function, it’s simple to apply and see the results.

We find that around 77% of records in our dataset have a null value for Cabin, and that almost 20% of records have a null value for Age.

This tallies with Oracle ADS values presented in “Warnings” from the previous blog, so it provides additional confidence that the function is working correctly.

Dropping missing values

To drop variables from the dataset, we can use the drop function from Pandas as below.

It is possible to drop either a row or a column, in this case we’re dropping the column, Cabin. To ensure this is the case, we include “axis=1” argument, which is used to specify that we’re dropping the column (and not a row).

It’s possible to remove multiple values at once, by using the drop function with square braces. You’ll notice that it’s very similar to the above, but specifies two fields, rather than one.

Best practice would be to run a quick check to make sure the columns have been deleted as we expected – and they have.

Imputing missing values

Imputing values can be useful in some cases, where you make an “educated estimate” on what the missing value should be, based on information that you do know.

A simple example of this is imputing Age.

We can impute values in various ways, we could simply provide a random number, but we improve on this.

We could provide the mean age (average) or median age (middle point of the dataset). Which one we use can be determined by the business problem or question we’re looking to solve.

We can use the describe function to provide some of this information for us:

We can see here the mean age is 30 – so we could simply apply the mean age to all missing ages.

To apply this change, we can utilise another function.

Now we can check to make sure age has no missing values, and it doesn’t, so we have successfully imputed that value.

We could take this one step further and look to impute ages based on a group of “similar” passengers.

Within the Titanic dataset, there are three class types (1, 2 and 3). We could impute ages by calculating the mean age of each class and apply that to the appropriate classes.

This utilises a group by, and we can group by Pclass and Age, to then calculate the mean value for these groups.

This shows there is variation in ages by class – those in class 1 tend to be older than those in classes 2 and 3, so there’s additional context to our data, which we would miss when just applying the mean age of the entire dataset.

This is just one example of enriching our dataset, to provide us with a more detailed and thorough dataset, which in turn we hope to return better outcomes from.

The Oracle Alchemist: Turning Data into Insights

Search This Blog

Data Science Toolkit - Part 2: Data Preparation

Comments

Post a Comment