# Concept of Overfitting

Demonstration of overfitting using an Polynomial Linear Regression example

**Objective**

To understand the concept of Overfitting using Linear Regression with Polynomial Features.

So let’s first understand **What is Regression**?

Have you ever thought of How can we predict price of a house or a car using Machine Learning ? Well, Regression technique is used.

**Regression** is used to predict a continuous value. Some of the commons Regression techniques are -

1. Simple Linear Regression

2. Multiple Linear Regression

3. Polynomial Linear Regression

Now let’s understand **what is Overfitting** briefly.

Let’s suppose we have a created a model & we want to check how well our model works on unseen data. Sometimes our model performs poor due to **Overfitting** or **Underfitting**.

When a model gives high accuracy on train dataset but performs poor on unseen dataset, then we call it as Overfitted model.

**Underfitting** is when a model performs poor on training dataset. underfitted models are unable to find relationship between input & target.

In this article, we will learn Overfitting Concept with Linear Regression with Polynomial Features.

*Let’s start*

we will create 20 random `uniform distributed`

values & then we will use `sin`

function to predict. we will work on **[0, 1, 3, 9]** order linear regression

Now we have our dataset X, y, let’s draw it.

Let’s divide our dataset into **train** & **test** dataset using **sklearn.**

Now lets define our model & plot graph for degrees `0, 1, 3, 9`

we will get graphs like this.

Let’s Display weights in tabular form

we have trained out model. It’s estimate our model. Let’s calcualte the train & test error.

[0.208575632499395, 0.20178321640091842, 0.15247400351622362, 0.10418786631408623, 0.09688701939648986, 0.09263963531131172, 0.08283677775295668, 0.06327629715761585, 0.06147112825631159] [0.7076444970946971, 0.7023260466949512, 1.7102279649595118, 8.775219946391115, 26.828407071463392, 117.45559764444442, 2039.7780917210393, 68181.7212153289, 714896.6962217717]

we can see the train & test error through this graph. as we can see, test error is huge, it means our model is overfitted.

So **How to prevent this Overfitting ?**

Overfitting can be prevented by

- Increasing Dataset
- Regularisation

**Increasing Dataset**

# divide the new dataset into train & testX_train_new, y_train_new, X_test_new, y_test_new = train_test_split(X_new, y_new, test_size=0.5)

**Regularisation**

the next approach to minimize loss is using **Regularization** technique.

In simple words, **Regularization** is used to prevent overfitting.

There are many types of regularization. we will use **L2 Regularisation** also called as **Ridge Regularisation**.

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated. It has uses in fields including econometrics, chemistry, and engineering

# Now draw test and train error according to lamda = 1, 1/10, 1/100, 1/1000, 1/10000, 1/100000

# Best Model (According to test performance)

as from the `Train error using L2`

& `Test error using L2`

graph, for each lambda values, our train error is almost the same but test error differs. we can see for lambda = 1, there is some test error whereas for lambda = 1/100000, test error is huge. so according to this graph `lambda = 1/100`

is a best model.

# My Contribution

I went thorugh various tutorials, understood code & implemented this on my own. added data points & experimented with multiple degrees as well as captured train & test error. Also plotted the graphs.

# Challanges

The first challange was to fit model with many degrees, used `pipeline`

module from `sklearn`

to fix this.

Next was to prevent overfitting, `Increased data`

& used `L2 Regularisation`

to fix this.

# Experiments & Finding

**Experiment** tried with many (1/1000000, 1/10000000) lambdas values to see wheather train & test error increase or decrease.

**Finding** — as we can see, for more lambda values `test error`

is getting increased.

# What’s Next

**Ensemble Technique**

you can read it here more. find the notebook.

# References

https://medium.datadriveninvestor.com/regression-in-machine-learning-296caae933ec

https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html

https://datascience.foundation/sciencewhitepaper/underfitting-and-overfitting-in-machine-learning

https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

https://medium.com/@minions.k/ridge-regression-l1-regularization-method-31b6bc03cbf

https://medium.com/all-about-ml/lasso-and-ridge-regularization-a0df473386d5

https://en.wikipedia.org/wiki/Ridge_regression

*Kindly let me know your feedback in comment section.*