Buying a used car – the Data Science way: Part 3

Aside from buying a home, a car is the second-most expensive purchase you will ever make in your life. With that in mind, it’s a crucial decision to get right!

This is a presentation I’d created last year, and I wanted to see if could combine my day-to-day work (Data Science) to understand if data and machine learning could help me decide which car to buy next.

It seems the perfect fit for a blog series on Data Science and Machine Learning, as it covers a range of analytics techniques and thought processes you could apply to several use cases.

If you missed any of the previous blogs, don’t worry!

·       Part 1 of the series, we covered identifying data and data quality checks, you can find it here.

·       Part 2 of the series covered Data Enrichments and Exploratory Data Analysis; you can find it here.

But for this blog we’re moving onto Machine Learning using built-in Oracle Data Mining in an Oracle Machine Learning Notebook.


Step 5: Machine Learning

Machine Learning can be confusing, so it is helpful to begin by clearly defining the term. As defined by IBM, machine learning is:

“a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.”

Machine learning is an important component of the growing field of data science. Using statistical methods, algorithms can be trained to make classifications or predictions to uncover key insights in data mining projects.

There are many services we use daily that rely heavily on Machine Learning, such as personalised recommendations from websites, or Netflix, to chatbots as a first port of call to resolving customer queries.

These insights subsequently drive decision-making within applications and businesses, ideally impacting key growth metrics.

Oracle database has over 30 fully scalable algorithms that are commonly used by Data Scientists, including:

  •       regression
  •       classification
  •       time series
  •       clustering
  •       feature extraction
  •       anomaly detection

For the used car data, there are two use cases we might want to explore.

·       Feature Extraction (Attribute Importance) – to determine which attributes (or fields) are most likely to be predictors of price of a used car 

·       Time Series – to predict the prices of these used cars in the future, to try and determine if they are likely to hold their value


Feature Extraction - Attribute Importance

Oracle Data Mining supports the attribute importance mining function, which ranks attributes according to their importance in predicting a target. In this case our target will be Price.

We provide the model settings several fields, including:

·       a MINING_FUNCTION - which Oracle Machine Learning algorithm to use, in this case its ATTRIBUTE_IMPORTANCE

·       a CASE_ID_COLUMN_NAME – which determines how we want to segment our data; in this case we use one of our data enrichments (RECORD_ID)

·       a TARGET_COLUMN_NAME – which is what we want to determine attribute importance of (i.e., the predictor) which is Price


The most important attributes determined by the algorithm are:

        Model

        Engine Size

        Age of Car (data enrichment)

        MPG Enriched (data enrichment)


What if we considered a newer car?

From our Exploratory Data Analysis, we know that 2020 is the latest year in our dataset, so we can use 2020 data to help limit our options down for which car to select.

The Attribute Importance model identified that key attributes are, Model, Engine Size and MPG enriched

Let us look at Ford cars, to compare these with Price to see if we can compare these variables with price to see what we can get for our £15k budget:


Model vs Price

Our choices look to be limited to Fiesta, Focus and KA based on my £15k budget on a newer car (2020)


Engine Size vs Price


Our choices look to be limited to smaller engines – between 1L and 1.1L


Age vs Price 

There are plenty of choices that come within our budget, but in this dataset, options have been limited to Ford Fiesta or Focus.


MPG vs Price


MPG for the most part seems to be similar (around 55mpg). It appears that MPG generally decreases as the price increases.

This is much better than my current MPG of around 37MPG!



What if we predicted future price of newer Ford Fiesta/Focus?

We could use a TIME SERIES Forecast to predict the prices of Focus and Fiestas, to determine if they are likely to hold value in the future.

To do this, we need to use historic data to learn from:

        Limit the dataset to cars £15k and under, to match our budget

        Include 5 years of historic data to learn from (2015-20)

        Predict 4 years into the future of what the value could be



Ford Fiesta looks to be slightly cheaper than the Focus and may give us greater options to choose from, while suggesting it will hold its value, so this will be our choice!


Actions

Now we have undertaken some EDA and Machine Learning, we now want to action these insights.

Considering my budget is £15k, the analysis suggests I will be looking for:

        a MAKE in more economical car ranges - this will be Ford

        an ENGINE SIZE of 1L or 1.1L

        MILEAGE should be under 10k miles

        a MODEL of Fiesta, using a forecast of future prices 


Outcome

The final stage of this project, in our case it will be using the data driven actions we have identified will be plugged into AutoTrader to help me pick out a used car.




After inputting those attributes into AutoTrader, we are returned with a Ford Fiesta at £14,200 which will leave us with £800 leftover. 

This is just one example of a data-driven result using Oracle Machine Learning Notebooks and Data Science to help with a real-life problem.

Comments