Buying a used car – the Data Science way: Part 1

Aside from buying a home, a car is the second-most expensive purchase you will ever make in your life. With that in mind, it’s a crucial decision to get right!

This is a presentation I’d created last year, and I wanted to see if could combine my day-to-day work in Data Science, to understand if data and machine learning could help me decide which car to buy next.

It seems the perfect fit for a blog series on Data Science and Machine Learning, as it covers a range of analytics techniques and thought processes you could apply to several use cases.

Step 1 – Identifying data

For my problem, the first thing I needed was to identify some suitable data – for this I wanted some historic data on second-hand cars.

The good news here is that there is a wealth of data available online, one great place to start is Kaggle.

Kaggle is an online community platform for data scientists and machine learning enthusiasts. It allows users to collaborate with other users, find and publish datasets, you can also compete with other data scientists to solve data science challenges (and sometimes there’s money involved!).

The real key here is the data is open-source and freely available, and it’s a dataset all about Used Cars in the UK

The data includes 100,000 records of UK second hand cars and has several key attributes that could be useful including

  • Make
  • Model
  • Engine Size
  • MPG
  • Price
  • etc.

The data can be downloaded as csv’s or simply use the Kaggle API.

 

Step 2 – Data Quality Checks

As with any new dataset, a good place to start is to perform some checks for data quality.

This allows us to understand the data, which is important in determining if it’s suitable for what we need. If its not, then we may need to find an alternative.

Some key points to think about:

  •       Size of the dataset – how many attributes and rows do we have
  •        How complete is the dataset - do we have any missing/NULL fields
  •        Data Types – are there any data types that are incorrect
  •        Duplicates – do we have any duplicate records that we need to handle

 

Size – we know already this is 100,000 rows, but a quick confirmation has made sure this is correct, and we have 13 attributes

Completeness - We can see that there are missing values for `Tax` and `MPG` which all relate to Mercedes C Class, so this could be a data loading error. Checking back to the source data would be a good check here. There is also an opportunity for data enrichments here, but we’ll come back to that.


Data Types – all data types look to be as expected, but we also need to think about applying some machine learning techniques – certain ML models expect certain data types. 

Duplicates – the dataset had been pre-cleaned, but a quick check confirms that there are no duplicate rows to worry about.


So far, we have found some historic data, we’ve worked through some data quality checks to make sure this is suitable.

Join us next time for the next blog of the series, where we’ll improve the dataset, we’re working with, by adding some data enrichments.

We’ll also perform some exploratory data analysis, to get us ready to apply some Machine Learning techniques.

Comments