Buying a used car – the Data Science way: Part 2

Aside from buying a home, a car is the second-most expensive purchase you will ever make in your life. With that in mind, it’s a crucial decision to get right!

This is a presentation I’d created last year, and I wanted to see if could combine my day-to-day work (Data Science) to understand if data and machine learning could help me decide which car to buy next.

It seems the perfect fit for a blog series on Data Science and Machine Learning, as it covers a range of analytics techniques and thought processes you could apply to several use cases.


If you missed Part 1 of the series, we covered identifying data and data quality checks, you can find it here

But for this blog we’re moving onto Data Enrichments and Exploratory Data Analysis

 

Step 3: Data Enrichments

The previous blog identified a dataset, which we have explored for any data quality issues, with any issues resolved, we now want to enrich the dataset. What do we mean by enrichment? Essentially, it’s making the raw data more useful – this could be imputing sparsely populated values, or creating features, or it could even be brining in additional datasets to supplement the data we already have.   

 

Impute missing values

In the first blog, we found there were some missing values that could impact our analysis.


There are missing values for `Tax` and `MPG` which all relate to Mercedes C Class. The good news is we have some historic data so we can impute those missing values with an average figure from known data.

We can use SQL to calculate an average MPG and TAX based on a partition of:

        Make

        Model

        Year

       Transmission  



The “Partition by” clause is used to break the data into small partitions and is separated by a boundary, in this case by Make (Mercedes), Model (C Class), Year and Transmission.

When we check for NULLS again, we will see all fields are now populated. 


Creating features

We can create new features within the dataset, which could be other attributes that could help us with our prediction of which used car to buy next.

There are four new attributes I’ve added:

  • Age of car (Current date – Year)
  • Mileage bands (to standardise mileage into useful bands
  • Record ID (to provide a unique identifier)
  • Randomly create a car colour

 

Step 4: Exploratory Data Analysis

At this stage, we can now start exploring some of the data and attributes using the built-in charts within OML Notebooks.


 Ford, Mercedes and VW are the most popular makes in the dataset.

This is useful, but it is difficult to understand proportions. We can do this in Oracle SQL utilizing RATIO_TO_REPORT Analytical Function.



We now can see that those top 3 make up over 50% of the dataset.


MPG varies, but there are clear clusters up to around 60 MPG.

There is an outlier at 470MPG, we may want to remove this (or investigate this in more detail considering the current fuel prices!)


Price summary statistics (minimum, median, maximum, and average price) suggests our £15k budget will likely limit us to the more economical brands e.g., Vauxhall, Toyota, Ford, Hyundai and Skoda.

To recap on this blog series:

  • we have found some historic data
  • we have worked through some data quality checks to make sure this is suitable.
  • we have improved the dataset by adding some data enrichments and created features
  • we have begun to delve into the dataset performing some exploratory data analysis

Join us next time as we apply some Machine Learning techniques, to apply to our used car dataset using built-in Oracle Data Mining functionality.


Comments