Concept of Overfitting

Jay Prakash Thakur
5 min readApr 3, 2021

--

Demonstration of overfitting using an Polynomial Linear Regression example

Objective

To understand the concept of Overfitting using Linear Regression with Polynomial Features.

So let’s first understand What is Regression?

Have you ever thought of How can we predict price of a house or a car using Machine Learning ? Well, Regression technique is used.

Regression is used to predict a continuous value. Some of the commons Regression techniques are -

1. Simple Linear Regression

2. Multiple Linear Regression

3. Polynomial Linear Regression

Now let’s understand what is Overfitting briefly.

Let’s suppose we have a created a model & we want to check how well our model works on unseen data. Sometimes our model performs poor due to Overfitting or Underfitting.

When a model gives high accuracy on train dataset but performs poor on unseen dataset, then we call it as Overfitted model.

Underfitting is when a model performs poor on training dataset. underfitted models are unable to find relationship between input & target.

In this article, we will learn Overfitting Concept with Linear Regression with Polynomial Features.

Let’s start

we will create 20 random uniform distributed values & then we will use sin function to predict. we will work on [0, 1, 3, 9] order linear regression

Now we have our dataset X, y, let’s draw it.

Let’s divide our dataset into train & test dataset using sklearn.

Now lets define our model & plot graph for degrees 0, 1, 3, 9

we will get graphs like this.

Let’s Display weights in tabular form

we have trained out model. It’s estimate our model. Let’s calcualte the train & test error.

[0.208575632499395, 0.20178321640091842, 0.15247400351622362, 0.10418786631408623, 0.09688701939648986, 0.09263963531131172, 0.08283677775295668, 0.06327629715761585, 0.06147112825631159] [0.7076444970946971, 0.7023260466949512, 1.7102279649595118, 8.775219946391115, 26.828407071463392, 117.45559764444442, 2039.7780917210393, 68181.7212153289, 714896.6962217717]

we can see the train & test error through this graph. as we can see, test error is huge, it means our model is overfitted.

So How to prevent this Overfitting ?

Overfitting can be prevented by

  1. Increasing Dataset
  2. Regularisation

Increasing Dataset

# divide the new dataset into train & testX_train_new, y_train_new, X_test_new, y_test_new = train_test_split(X_new, y_new, test_size=0.5)

Regularisation

the next approach to minimize loss is using Regularization technique.

In simple words, Regularization is used to prevent overfitting.

There are many types of regularization. we will use L2 Regularisation also called as Ridge Regularisation.

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated. It has uses in fields including econometrics, chemistry, and engineering

Now draw test and train error according to lamda = 1, 1/10, 1/100, 1/1000, 1/10000, 1/100000

Best Model (According to test performance)

as from the Train error using L2 & Test error using L2 graph, for each lambda values, our train error is almost the same but test error differs. we can see for lambda = 1, there is some test error whereas for lambda = 1/100000, test error is huge. so according to this graph lambda = 1/100 is a best model.

My Contribution

I went thorugh various tutorials, understood code & implemented this on my own. added data points & experimented with multiple degrees as well as captured train & test error. Also plotted the graphs.

Challanges

The first challange was to fit model with many degrees, used pipeline module from sklearn to fix this.

Next was to prevent overfitting, Increased data & used L2 Regularisation to fix this.

Experiments & Finding

Experiment tried with many (1/1000000, 1/10000000) lambdas values to see wheather train & test error increase or decrease.

Finding — as we can see, for more lambda values test error is getting increased.

What’s Next

Ensemble Technique

you can read it here more. find the notebook.

References

https://medium.datadriveninvestor.com/regression-in-machine-learning-296caae933ec

https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html

https://datascience.foundation/sciencewhitepaper/underfitting-and-overfitting-in-machine-learning

https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

https://medium.com/@minions.k/ridge-regression-l1-regularization-method-31b6bc03cbf

https://medium.com/all-about-ml/lasso-and-ridge-regularization-a0df473386d5

https://en.wikipedia.org/wiki/Ridge_regression

Kindly let me know your feedback in comment section.

--

--

Jay Prakash Thakur
Jay Prakash Thakur

Written by Jay Prakash Thakur

Microsoft Senior Software Engineer | Exploring AI Agents | GenAI, LLMs | Applied Data Science, ML/DL | Making AI accessible | Speaker | Aspiring AI advisor

No responses yet