Concept of Overfitting

5 min readApr 3, 2021

Demonstration of overfitting using an Polynomial Linear Regression example

Objective

To understand the concept of Overfitting using Linear Regression with Polynomial Features.

So let’s first understand What is Regression?

Have you ever thought of How can we predict price of a house or a car using Machine Learning ? Well, Regression technique is used.

Regression is used to predict a continuous value. Some of the commons Regression techniques are -

1. Simple Linear Regression

2. Multiple Linear Regression

3. Polynomial Linear Regression

Now let’s understand what is Overfitting briefly.

Let’s suppose we have a created a model & we want to check how well our model works on unseen data. Sometimes our model performs poor due to Overfitting or Underfitting.

When a model gives high accuracy on train dataset but performs poor on unseen dataset, then we call it as Overfitted model.

Underfitting is when a model performs poor on training dataset. underfitted models are unable to find relationship between input & target.

In this article, we will learn Overfitting Concept with Linear Regression with Polynomial Features.

Let’s start

we will create 20 random uniform distributed values & then we will use sin function to predict. we will work on [0, 1, 3, 9] order linear regression

Now we have our dataset X, y, let’s draw it.

Let’s divide our dataset into train & test dataset using sklearn.

Now lets define our model & plot graph for degrees 0, 1, 3, 9

we will get graphs like this.

Let’s Display weights in tabular form

we have trained out model. It’s estimate our model. Let’s calcualte the train & test error.

[0.208575632499395, 0.20178321640091842, 0.15247400351622362, 0.10418786631408623, 0.09688701939648986, 0.09263963531131172, 0.08283677775295668, 0.06327629715761585, 0.06147112825631159] [0.7076444970946971, 0.7023260466949512, 1.7102279649595118, 8.775219946391115, 26.828407071463392, 117.45559764444442, 2039.7780917210393, 68181.7212153289, 714896.6962217717]

we can see the train & test error through this graph. as we can see, test error is huge, it means our model is overfitted.

So How to prevent this Overfitting ?

Overfitting can be prevented by

Increasing Dataset
Regularisation

Increasing Dataset

# divide the new dataset into train & testX_train_new, y_train_new, X_test_new, y_test_new = train_test_split(X_new, y_new, test_size=0.5)

Regularisation

the next approach to minimize loss is using Regularization technique.

In simple words, Regularization is used to prevent overfitting.

There are many types of regularization. we will use L2 Regularisation also called as Ridge Regularisation.

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated. It has uses in fields including econometrics, chemistry, and engineering

Now draw test and train error according to lamda = 1, 1/10, 1/100, 1/1000, 1/10000, 1/100000

Best Model (According to test performance)

as from the Train error using L2 & Test error using L2 graph, for each lambda values, our train error is almost the same but test error differs. we can see for lambda = 1, there is some test error whereas for lambda = 1/100000, test error is huge. so according to this graph lambda = 1/100 is a best model.

My Contribution

I went thorugh various tutorials, understood code & implemented this on my own. added data points & experimented with multiple degrees as well as captured train & test error. Also plotted the graphs.