Concept of Overfitting
Demonstration of overfitting using an Polynomial Linear Regression example
Objective
To understand the concept of Overfitting using Linear Regression with Polynomial Features.
So let’s first understand What is Regression?
Have you ever thought of How can we predict price of a house or a car using Machine Learning ? Well, Regression technique is used.
Regression is used to predict a continuous value. Some of the commons Regression techniques are -
1. Simple Linear Regression
2. Multiple Linear Regression
3. Polynomial Linear Regression
Now let’s understand what is Overfitting briefly.
Let’s suppose we have a created a model & we want to check how well our model works on unseen data. Sometimes our model performs poor due to Overfitting or Underfitting.
When a model gives high accuracy on train dataset but performs poor on unseen dataset, then we call it as Overfitted model.
Underfitting is when a model performs poor on training dataset. underfitted models are unable to find relationship between input & target.
In this article, we will learn Overfitting Concept with Linear Regression with Polynomial Features.
Let’s start
we will create 20 random uniform distributed
values & then we will use sin
function to predict. we will work on [0, 1, 3, 9] order linear regression
Now we have our dataset X, y, let’s draw it.
Let’s divide our dataset into train & test dataset using sklearn.
Now lets define our model & plot graph for degrees 0, 1, 3, 9
we will get graphs like this.
Let’s Display weights in tabular form
we have trained out model. It’s estimate our model. Let’s calcualte the train & test error.
[0.208575632499395, 0.20178321640091842, 0.15247400351622362, 0.10418786631408623, 0.09688701939648986, 0.09263963531131172, 0.08283677775295668, 0.06327629715761585, 0.06147112825631159] [0.7076444970946971, 0.7023260466949512, 1.7102279649595118, 8.775219946391115, 26.828407071463392, 117.45559764444442, 2039.7780917210393, 68181.7212153289, 714896.6962217717]
we can see the train & test error through this graph. as we can see, test error is huge, it means our model is overfitted.
So How to prevent this Overfitting ?
Overfitting can be prevented by
- Increasing Dataset
- Regularisation
Increasing Dataset
# divide the new dataset into train & testX_train_new, y_train_new, X_test_new, y_test_new = train_test_split(X_new, y_new, test_size=0.5)
Regularisation
the next approach to minimize loss is using Regularization technique.
In simple words, Regularization is used to prevent overfitting.
There are many types of regularization. we will use L2 Regularisation also called as Ridge Regularisation.
Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated. It has uses in fields including econometrics, chemistry, and engineering
Now draw test and train error according to lamda = 1, 1/10, 1/100, 1/1000, 1/10000, 1/100000
Best Model (According to test performance)
as from the Train error using L2
& Test error using L2
graph, for each lambda values, our train error is almost the same but test error differs. we can see for lambda = 1, there is some test error whereas for lambda = 1/100000, test error is huge. so according to this graph lambda = 1/100
is a best model.
My Contribution
I went thorugh various tutorials, understood code & implemented this on my own. added data points & experimented with multiple degrees as well as captured train & test error. Also plotted the graphs.
Challanges
The first challange was to fit model with many degrees, used pipeline
module from sklearn
to fix this.
Next was to prevent overfitting, Increased data
& used L2 Regularisation
to fix this.
Experiments & Finding
Experiment tried with many (1/1000000, 1/10000000) lambdas values to see wheather train & test error increase or decrease.
Finding — as we can see, for more lambda values test error
is getting increased.
What’s Next
Ensemble Technique
you can read it here more. find the notebook.
References
https://medium.datadriveninvestor.com/regression-in-machine-learning-296caae933ec
https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html
https://datascience.foundation/sciencewhitepaper/underfitting-and-overfitting-in-machine-learning
https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html
https://medium.com/@minions.k/ridge-regression-l1-regularization-method-31b6bc03cbf
https://medium.com/all-about-ml/lasso-and-ridge-regularization-a0df473386d5
https://en.wikipedia.org/wiki/Ridge_regression
Kindly let me know your feedback in comment section.