In my years as a Data Scientist, and working with data in general, I’ve picked up a lot of tips and tricks along the way.
My latest blog series is looking to share some of those
I’ve found particularly helpful, with the hope that you could use some of these
to make your day-to-day work life a little easier.
For this first blog in the series, we’ll go back to the beginning of any data science project, data exploration & data preparation, and utilizing the Oracle Accelerated Data Science (ADS) SDK.
Data Exploration is key, if we don’t know our data, then how can we even begin to create any useful Machine Learning models? This step helps us to describe and therefore understand the data. We can understand the quality of the dataset, which can then determine if it’s appropriate or suitable for us to start training models on.
Some key points to think about:
· Size
of the dataset – how many attributes and rows do we have
· How
complete is the dataset - do we have any missing/NULL fields.
· Data
Types – are there any data types that are incorrect.
· Duplicates
–
do we have any duplicate records that we need to handle.
Data Preparation is
where we look enrich our dataset and transform it to suit our needs.
This might be that we look to drop certain fields, clean the data depending on
our requirements, we might also need to transform the data (e.g., one hot
encoding) or create new features.
All of this takes time, in some cases a lot of time, but
one package I’ve found has really with this is Oracle ADS (Accelerated Data
Science SDK).
What is Oracle Accelerated Data Science (ADS)?
It is a Python package that I’ve been using in Oracle Data
Science Platform to help data scientists through the entire end-to-end data
science workflow.
How to install?
Load in our data
For this example, I will use the popular Titanic dataset,
which you can download from Kaggle here. For
the purposes of testing, we’ll just use the test dataset.

What functions are available?
There are several functions that are available in Oracle
ADS given that it can be utilized from returning data from object storage, all
the way through to Machine Learning Model Deployment.
The focus on this blog will be on two functions for data
exploration:
- Show_in_notebook
- Suggest_recommendations
Show_in_notebook
Oracle ADS `show_in_notebook` functioncreates a preview of
all the basic information about the data set.
It gives a great overview the data, number of rows and
columns, data types/feature types of each column, visualisations of each
column, correlations, and warnings about columns. These warnings are things
like sparsely populated, or highly skewed columns for example.
•
A Summary expressing the overall
features of the dataset.
•
For each Feature, a dedicated chart will
represent the distribution.
•
Correlation tab
shows the similarity between features.
•
Data tab shows
examples of the dataset.
Note: in order to use show_in_notebook, you need to transform
your data to an `ads.dataset`, the code snippet above should help.
Summary
Features
A chart is produced for each attribute, which will explain
the distribution, as well as some summary statistics that are useful.
In our example data, we can quickly explore Survived
attribute, or Fare, amongst many others within the dataset.
Correlations
We can understand which features are highly correlated (or
not), and Oracle ADS provides the option to view a number of options here
including:
- · Continuous
vs Continuous
- · Category
vs Category
- · Continuous
vs Category
Another handy option is Warnings, which explain key points
to be aware of regarding your dataset, this can include missing values, high or
low cardinality, all of which can play a role in your decision to clean the
data.
Warnings
Suggest_recommendations
This function flags any issues that have been identified
with the dataset and recommends changes/remediations to fix the issue.
As the screenshot below, you can see a number of options have been suggested for Age, which contains a number of missing values. The recommendation is to fill missing values with mean, which could well be suitable for what we need. However as always, you’ll need to apply any domain knowledge to these suggestions, but they are certainly useful first steps to take.
The best part of these functions is, as you've likely noticed already, they run in a single
line of code and can reduce your data exploration time down from hours to a
matter of minutes.
Data Preparation is then taking some of these suggestions from above, combined with your domain knowledge and business requirements and apply these to your dataset.
Comments
Post a Comment