Data Science Toolkit – Part 1: Data Exploration with Oracle ADS

In my years as a Data Scientist, and working with data in general, I’ve picked up a lot of tips and tricks along the way.

My latest blog series is looking to share some of those I’ve found particularly helpful, with the hope that you could use some of these to make your day-to-day work life a little easier.

For this first blog in the series, we’ll go back to the beginning of any data science project, data exploration & data preparation, and utilizing the Oracle Accelerated Data Science (ADS) SDK.

 

Data Exploration is key, if we don’t know our data, then how can we even begin to create any useful Machine Learning models? This step helps us to describe and therefore understand the data. We can understand the quality of the dataset, which can then determine if it’s appropriate or suitable for us to start training models on.

Some key points to think about:

·       Size of the dataset – how many attributes and rows do we have

·       How complete is the dataset - do we have any missing/NULL fields.

·       Data Types – are there any data types that are incorrect.

·       Duplicates – do we have any duplicate records that we need to handle.

 

Data Preparation is where we look enrich our dataset and transform it to suit our needs. This might be that we look to drop certain fields, clean the data depending on our requirements, we might also need to transform the data (e.g., one hot encoding) or create new features.

All of this takes time, in some cases a lot of time, but one package I’ve found has really with this is Oracle ADS (Accelerated Data Science SDK).

 

What is Oracle Accelerated Data Science (ADS)?

It is a Python package that I’ve been using in Oracle Data Science Platform to help data scientists through the entire end-to-end data science workflow.

 

How to install?

Load in our data

For this example, I will use the popular Titanic dataset, which you can download from Kaggle here. For the purposes of testing, we’ll just use the test dataset.


What functions are available?

There are several functions that are available in Oracle ADS given that it can be utilized from returning data from object storage, all the way through to Machine Learning Model Deployment.

The focus on this blog will be on two functions for data exploration:

  • Show_in_notebook
  • Suggest_recommendations

 

Show_in_notebook

Oracle ADS `show_in_notebook` functioncreates a preview of all the basic information about the data set.

It gives a great overview the data, number of rows and columns, data types/feature types of each column, visualisations of each column, correlations, and warnings about columns. These warnings are things like sparsely populated, or highly skewed columns for example.

        A Summary expressing the overall features of the dataset.

        For each Feature, a dedicated chart will represent the distribution.

        Correlation tab shows the similarity between features.

        Data tab shows examples of the dataset.

 

Note: in order to use show_in_notebook, you need to transform your data to an `ads.dataset`, the code snippet above should help.

 

Summary


Features

A chart is produced for each attribute, which will explain the distribution, as well as some summary statistics that are useful.

In our example data, we can quickly explore Survived attribute, or Fare, amongst many others within the dataset.




Correlations

We can understand which features are highly correlated (or not), and Oracle ADS provides the option to view a number of options here including:

  • ·       Continuous vs Continuous
  • ·       Category vs Category
  • ·       Continuous vs Category


Another handy option is Warnings, which explain key points to be aware of regarding your dataset, this can include missing values, high or low cardinality, all of which can play a role in your decision to clean the data.

 

Warnings


Suggest_recommendations

This function flags any issues that have been identified with the dataset and recommends changes/remediations to fix the issue.

As the screenshot below, you can see a number of options have been suggested for Age, which contains a number of missing values. The recommendation is to fill missing values with mean, which could well be suitable for what we need. However as always, you’ll need to apply any domain knowledge to these suggestions, but they are certainly useful first steps to take.



The best part of these functions is, as you've likely noticed already, they run in a single line of code and can reduce your data exploration time down from hours to a matter of minutes.

Data Preparation is then taking some of these suggestions from above, combined with your domain knowledge and business requirements and apply these to your dataset.

Comments