Demonstration of overfitting using an Polynomial Linear Regression example
To understand the concept of Overfitting using Linear Regression with Polynomial Features.
So let’s first understand What is Regression?
Have you ever thought of How can we predict price of a house or a car using Machine Learning ? Well, Regression technique is used.
Regression is used to predict a continuous value. Some of the commons Regression techniques are -
1. Simple Linear Regression
2. Multiple Linear Regression
3. Polynomial Linear Regression
Now let’s understand what is Overfitting briefly.
Let’s suppose we have a created a model & we want to check how well our model works on unseen data. Sometimes our model performs poor due to Overfitting or Underfitting.
When a model gives high accuracy on train dataset but performs poor on unseen dataset, then we call it as Overfitted model.
Underfitting is when a model performs poor on training dataset. underfitted models are unable to find relationship between input & target.
In this article, we will learn Overfitting Concept with Linear Regression with Polynomial Features.
we will create 20 random
uniform distributed values & then we will use
sin function to predict. we will work on [0, 1, 3, 9] order linear regression
Now we have our dataset X, y, let’s draw it.
Let’s divide our dataset into train & test dataset using sklearn.
Now lets define our model & plot graph for degrees
0, 1, 3, 9
we will get graphs like this.
Let’s Display weights in tabular form
we have trained out model. It’s estimate our model. Let’s calcualte the train & test error.
[0.208575632499395, 0.20178321640091842, 0.15247400351622362, 0.10418786631408623, 0.09688701939648986, 0.09263963531131172, 0.08283677775295668, 0.06327629715761585, 0.06147112825631159] [0.7076444970946971, 0.7023260466949512, 1.7102279649595118, 8.775219946391115, 26.828407071463392, 117.45559764444442, 2039.7780917210393, 68181.7212153289, 714896.6962217717]
we can see the train & test error through this graph. as we can see, test error is huge, it means our model is overfitted.
So How to prevent this Overfitting ?
Overfitting can be prevented by
- Increasing Dataset
# divide the new dataset into train & testX_train_new, y_train_new, X_test_new, y_test_new = train_test_split(X_new, y_new, test_size=0.5)
the next approach to minimize loss is using Regularization technique.
In simple words, Regularization is used to prevent overfitting.
There are many types of regularization. we will use L2 Regularisation also called as Ridge Regularisation.
Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated. It has uses in fields including econometrics, chemistry, and engineering
Now draw test and train error according to lamda = 1, 1/10, 1/100, 1/1000, 1/10000, 1/100000
Best Model (According to test performance)
as from the
Train error using L2 &
Test error using L2 graph, for each lambda values, our train error is almost the same but test error differs. we can see for lambda = 1, there is some test error whereas for lambda = 1/100000, test error is huge. so according to this graph
lambda = 1/100 is a best model.
I went thorugh various tutorials, understood code & implemented this on my own. added data points & experimented with multiple degrees as well as captured train & test error. Also plotted the graphs.
The first challange was to fit model with many degrees, used
pipeline module from
sklearn to fix this.
Next was to prevent overfitting,
Increased data & used
L2 Regularisation to fix this.
Experiments & Finding
Experiment tried with many (1/1000000, 1/10000000) lambdas values to see wheather train & test error increase or decrease.
Finding — as we can see, for more lambda values
test error is getting increased.
Kindly let me know your feedback in comment section.