Aside from buying a home, a car is the second-most expensive purchase you will ever make in your life. With that in mind, it’s a crucial decision to get right!
This is a presentation I’d created last year, and I wanted
to see if could combine my day-to-day work in Data Science, to understand if data
and machine learning could help me decide which car to buy next.
It seems the perfect fit for a blog series on Data Science and Machine Learning, as it covers a range of analytics techniques and thought processes you could apply to several use cases.
Step 1 –
Identifying data
For my problem, the first thing I needed was to identify
some suitable data – for this I wanted some historic data on second-hand
cars.
The good news here is that there is a wealth of data available
online, one great place to start is Kaggle.
Kaggle is an online community platform for data
scientists and machine learning enthusiasts. It allows users to collaborate
with other users, find and publish datasets, you can also compete with other
data scientists to solve data science challenges (and sometimes there’s money
involved!).
The real key here is the data is open-source and freely
available, and it’s a dataset all about Used
Cars in the UK
The data includes 100,000 records of UK second hand cars and has several key attributes that could be useful including
- Make
- Model
- Engine Size
- MPG
- Price
- etc.
The data can be downloaded as csv’s or simply use the
Kaggle API.
Step 2 – Data
Quality Checks
As with any new dataset, a good place to start is to
perform some checks for data quality.
This allows us to understand the data, which is important in determining if it’s suitable for what we need. If its not, then we may need to find an alternative.
Some key points to think about:
- Size
of the dataset – how many attributes and rows do we have
- How
complete is the dataset - do we have any missing/NULL fields
- Data
Types – are there any data types that are incorrect
- Duplicates
–
do we have any duplicate records that we need to handle
Size – we know already this is 100,000 rows, but a quick confirmation has made sure this is correct, and we have 13 attributes
Completeness - We can see that there are missing values for `Tax` and
`MPG` which all relate to Mercedes C Class, so this could be a data loading
error. Checking back to the source data would be a good check here. There is
also an opportunity for data enrichments here, but we’ll come back to that.
Data Types – all data types look to be as expected, but we also need to think about applying some machine learning techniques – certain ML models expect certain data types.
Duplicates – the dataset had been pre-cleaned, but a quick check confirms that there are no duplicate rows to worry about.
So far, we have found some historic data, we’ve worked
through some data quality checks to make sure this is suitable.
Join us next time for the next blog of the series, where
we’ll improve the dataset, we’re working with, by adding some data
enrichments.
We’ll also perform some exploratory data analysis,
to get us ready to apply some Machine Learning techniques.
Comments
Post a Comment