Aside from buying a home, a car is the second-most expensive purchase you will ever make in your life. With that in mind, it’s a crucial decision to get right!
This is a presentation I’d created last year, and I wanted
to see if could combine my day-to-day work (Data Science) to understand if data
and machine learning could help me decide which car to buy next.
It seems the perfect fit for a blog series on Data Science
and Machine Learning, as it covers a range of analytics techniques and thought
processes you could apply to several use cases.
If you missed Part 1 of the series, we covered identifying
data and data quality checks, you can find it here
But for this blog we’re moving onto Data Enrichments
and Exploratory Data Analysis
Step 3: Data
Enrichments
The previous blog identified a dataset, which we have
explored for any data quality issues, with any issues resolved, we now want to
enrich the dataset. What do we mean by enrichment? Essentially, it’s making the
raw data more useful – this could be imputing sparsely populated values,
or creating features, or it could even be brining in additional datasets
to supplement the data we already have.
Impute missing values
In the first blog, we found there were some missing values
that could impact our analysis.
There are missing values for `Tax` and `MPG` which all
relate to Mercedes C Class. The good news is we have some historic data so we
can impute those missing values with an average figure from known data.
We can use SQL to calculate an average MPG and TAX based on
a partition of:
•
Make
•
Model
•
Year
• Transmission
The “Partition by” clause is used to break the data into small partitions and is
separated by a boundary, in this case by Make (Mercedes), Model (C Class), Year
and Transmission.
When we check for NULLS again, we will see all fields are
now populated.
Creating features
We can create new features within the dataset, which could
be other attributes that could help us with our prediction of which used car to
buy next.
There are four new attributes I’ve added:
- Age of car (Current date – Year)
- Mileage bands (to standardise mileage into useful bands
- Record ID (to provide a unique identifier)
- Randomly create a car colour
Step 4:
Exploratory Data Analysis
At this stage, we can now start exploring some of the data and attributes using the built-in charts within OML Notebooks.
This is useful, but it is difficult to understand
proportions. We can do this in Oracle SQL utilizing RATIO_TO_REPORT
Analytical Function.
We now can see that those top 3 make up over 50% of the dataset.
MPG varies, but there are clear
clusters up to around 60 MPG.
There is an outlier at 470MPG, we may want to remove this
(or investigate this in more detail considering the current fuel prices!)
Price summary statistics (minimum, median, maximum, and average price)
suggests our £15k budget will likely limit us to the more economical brands
e.g., Vauxhall, Toyota, Ford, Hyundai and Skoda.
To recap on this blog series:
- we have found some historic data
- we have worked through some data quality checks to make sure this is suitable.
- we have improved the dataset by adding some data enrichments and created features
- we have begun to delve into the dataset performing some exploratory data analysis
Join us next time as we apply some Machine Learning
techniques, to apply to our used car dataset using built-in Oracle Data Mining
functionality.






Comments
Post a Comment