## Overfitting using higher order Linear Regression

##### 8th April 2021
The reason for a poor prediction of a machine learning model is due to overfitting or underfitting. In this post lets go over these scenarios.

## My Contribution

#### Underfitting

A model is said to be underfitting when it cannnot generalize neither on the training data nor on the unseen data. In this case there loss is high for both cases. An underfitting model is not a suitable model and will be obvious as it will have a poor performance. Often times this scenario is easily overcome to give good performance just by tweaking the hyperparameter or try different machine learning models.

#### Overfitting

This scenario refers to when a model performs well only for the training dataset but not for any unseen data. This happens when a model learns the noise and details in the training dataset to the extent where it predicts poorly for new data. In many situation a developer can run into overfitting a model when the model is built very complex for the training data or there is very few training data.

#### How to solve the problem of overfitting ?

• Increase training dataset
• Model regularization
• Reduce model complexity
• #### Experiment and Results

Import Packages

```                    ```
import numpy as np
import matplotlib.pyplot as plt
import math
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
```
```

Define true function to plot on the graph

```                    ```
def true_function():
t = np.linspace(0, 1, 100)
s=[]
for i in range(0,100):
s.append(np.sin(2 * np.pi * t[i]))
plt.plot(t, s, label='True function', color='black')
```
```

Generate 20 data pairs (X, Y) using y = sin(2*pi*X) + 0.1 * N

• Use uniform distribution between 0 and 1 for X
• Sample N from the normal gaussian distribution
• Use 10 for train and 10 for test

```                    ```
x_data = np.random.uniform(0, 1, 20)
y_data = np.sin(2 * np.pi * x_data) + np.random.normal(0, 1, 20) * 0.2
plt.scatter(x_data, y_data, label='samples')
true_function()
plt.legend()

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.5, random_state=42)

x_train, y_train = zip(*sorted(zip(x_train, y_train)))
x_test, y_test = zip(*sorted(zip(x_test, y_test)))
```
``` Lets define a fucntion to calcuate root mean squared error as we are going to use that often

```                    ```
def cal_rmse(true, pred):
return np.sqrt(mean_squared_error(true, pred))
```
```

Fit model for degree = 0

```                    ```
polynomial_features = PolynomialFeatures(0)
x_reshape = polynomial_features.fit_transform(np.asarray(x_train).reshape(-1,1))
model0 = LinearRegression()
model0.fit(x_reshape, y_train)

x = np.linspace(0, 1, 100)
x_plot = polynomial_features.fit_transform(x.reshape(-1, 1))
y_pred_plot = model0.predict(x_plot)

y_pred = model0.predict(x_reshape)
train_rmse = cal_rmse(y_train, y_pred)
print("Train RMSE : ", train_rmse)

plt.plot(x, y_pred_plot, label="Model")
plt.scatter(x_train, y_train, c='green', label='Samples')
true_function()
plt.title('M=0 Train')
plt.xlabel('x')
plt.ylim(-1.5 ,1.5)
plt.ylabel('y')
plt.legend()
plt.show()
```
``` Fit model for degree = 1

```                    ```
polynomial_features = PolynomialFeatures(1)
x_reshape= polynomial_features.fit_transform(np.asarray(x_train).reshape(-1,1))
model1 = LinearRegression()
model1.fit(x_reshape, y_train)

x = np.linspace(0, 1, 100)
x_plot = polynomial_features.fit_transform(x.reshape(-1, 1))
y_pred_plot = model1.predict(x_plot)

y_pred = model1.predict(x_reshape)
train_rmse = cal_rmse(y_train, y_pred)
print("Train RMSE : ", train_rmse)

plt.plot(x, y_pred_plot, label="Model")
plt.scatter(x_train, y_train, c='green', label='Samples')
true_function()
plt.title('M=1 Train')
plt.xlabel('x')
plt.ylim(-1.5 ,1.5)
plt.ylabel('y')
plt.legend()
plt.show()
```
``` Fit model for degree = 3

```                    ```
polynomial_features = PolynomialFeatures(3)
x_reshape= polynomial_features.fit_transform(np.asarray(x_train).reshape(-1,1))
model3 = LinearRegression()
model3.fit(x_reshape, y_train)

x = np.linspace(0, 1, 100)
x_plot = polynomial_features.fit_transform(x.reshape(-1, 1))
y_pred_plot = model3.predict(x_plot)

y_pred = model3.predict(x_reshape)
train_rmse = cal_rmse(y_train, y_pred)
print("Train RMSE : ", train_rmse)

plt.plot(x, y_pred_plot, label="Model")
plt.scatter(x_train, y_train, c='green', label='Samples')
true_function()
plt.title('M=3 Train')
plt.xlabel('x')
plt.ylim(-1.5 ,1.5)
plt.ylabel('y')
plt.legend()
plt.show()
```
``` Fit model for degree = 9

```                    ```
polynomial_features = PolynomialFeatures(9)
x_reshape= polynomial_features.fit_transform(np.asarray(x_train).reshape(-1,1))
model9 = LinearRegression()
model9.fit(x_reshape, y_train)

x = np.linspace(0, 1, 100)
x_plot = polynomial_features.fit_transform(x.reshape(-1, 1))
y_pred_plot = model9.predict(x_plot)

y_pred = model9.predict(x_reshape)
train_rmse = cal_rmse(y_train, y_pred)
print("Train RMSE : ", train_rmse)

plt.plot(x, y_pred_plot, label="Model")
plt.scatter(x_train, y_train, c='green', label='Samples')
true_function()
plt.title('M=9 Train')
plt.xlabel('x')
plt.ylim(-1.5 ,1.5)
plt.ylabel('y')
plt.legend()
plt.show()
```
``` Now lets plot the train against test error to indentify which model is overfitting. Any guesses before that?
If you said model of degree 9 you are right.

```                    ```
def gen_train_test_loss():
train_error = []
test_error = []
for i in range(0, 10):
polynomial_features = PolynomialFeatures(degree=i)
x_train_poly = polynomial_features.fit_transform(np.asarray(x_train).reshape(-1, 1))
x_test_poly = polynomial_features.fit_transform(np.asarray(x_test).reshape(-1, 1))

poly_model = LinearRegression()
poly_model.fit(x_train_poly, y_train)
y_train_pred = poly_model.predict(x_train_poly)

train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_error.append(train_rmse)

y_test_pred = poly_model.predict(x_test_poly)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_error.append(test_rmse)

return train_error, test_error

train_error, test_error = gen_train_test_loss()
plt.plot(np.arange(0, 10, 1), train_error, color='r', label='Train Loss', marker='o')
plt.plot(np.arange(0, 10, 1), test_error, color='m', label='Test Loss', marker='o')
plt.title('Train vs Test Loss')
plt.xlabel('Degree')
plt.ylabel('Loss')
plt.ylim(0, 1)
plt.legend(loc='upper left')
plt.show()
```
``` Display weights in table

```                    ```
weights_frame = pd.DataFrame(np.ndarray((10,5)))
weights_frame.columns = ['weights','M=0','M=1','M=3','M=9']
weights_frame['weights'] = ['W'+str(index) for index in range(10)]

weights_frame['M=0'] = model0.coef_.tolist() + *(10-len(model0.coef_))
weights_frame['M=1'] = model1.coef_.tolist() + *(10-len(model1.coef_))
weights_frame['M=3'] = model3.coef_.tolist() + *(10-len(model3.coef_))
weights_frame['M=9'] = model9.coef_.tolist() + *(10-len(model9.coef_))

with pd.option_context('display.float_format', '{:0.2f}'.format):
```
``` Now lets generate 100 more data and fit 9th order model and draw fit

```                    ```
x_data_100 = np.random.uniform(0,1, 100)
N_100 = np.random.normal(0, 1, 100)
y_data_100 = np.sin(2 * np.pi * x_data_100) + N_100 * 0.3
plt.scatter(x_data_100, y_data_100)
true_function()

x_data_100, y_data_100 = zip(*sorted(zip(x_data_100, y_data_100)))

polynomial_features = PolynomialFeatures(degree=9)
x_train_poly_100 = polynomial_features.fit_transform(np.asarray(x_data_100).reshape(-1, 1))

model9 = LinearRegression()
model9.fit(x_train_poly_100, y_data_100)

x = np.linspace(0, 1, 250)
x_plot = polynomial_features.fit_transform(x.reshape(-1, 1))
y_pred_plot = model9.predict(x_plot)

y_pred = model9.predict(x_train_poly_100)
train_rmse = cal_rmse(y_data_100, y_pred)
print("Train RMSE : ", train_rmse)

plt.scatter(x_data_100, y_data_100, s=25, label='Samples', color='orange')
plt.plot(x, y_pred_plot, color='red', label='Model')
true_function()
plt.title('M=9 N=100 Train')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc='upper right')
plt.show()
```
``` Now we will regularize using the sum of weights using ridge model and plot graphs for different lambda value.

```                    ```
def plot_lam_graphs(deg, lambda_value):
polynomial_features= PolynomialFeatures(degree = deg)
x_reshape = polynomial_features.fit_transform(np.asarray(x_train).reshape(-1, 1))
ridge_model = Ridge(alpha=lambda_value)
ridge_model.fit(x_reshape, y_train)

x = np.linspace(0, 1, 100)
x_train_plot = polynomial_features.fit_transform(np.asarray(x).reshape(-1, 1))
Y_pred = ridge_model.predict(x_train_plot)

plt.plot(x, Y_pred, color='r',label="ln lamda = "+str(lambda_value))
plt.scatter(x_train, y_train,label='Samples')
true_function()
plt.xlabel('x')
plt.ylabel('y')
plt.title("M=9")
plt.legend()
plt.show()

lambda_values = np.array([1, 0.1, 0.01, 0.001, 0.0001, 0.00001])
for i in lambda_values:
plot_lam_graphs(9, i)
```
```      Train and Test error for different lambda values

```                    ```
train_ridge_error = []
test_ridge_error  = []

def calc_ridge_error(Lambda):
polynomial_features = PolynomialFeatures(9)
x_train_reshape = polynomial_features.fit_transform(np.asarray(x_train).reshape(-1,1))
x_test_reshape = polynomial_features.fit_transform(np.asarray(x_test).reshape(-1,1))
ridge_model = Ridge(Lambda)
ridge_model.fit(x_train_reshape, y_train)
y_train_pred = ridge_model.predict(x_train_reshape)
y_test_pred = ridge_model.predict(x_test_reshape)
train_rmse = cal_rmse(y_train, y_train_pred)
test_rmse = cal_rmse(y_test, y_test_pred)
train_ridge_error.append(train_rmse)
test_ridge_error.append(test_rmse)

lambda_param = [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]
plot_lambda = [ np.log(1), np.log(1/10), np.log(1/100), np.log(1/1000), np.log(1/10000), np.log(1/100000) ]
for i in lambda_param:
calc_ridge_error(i)

plt.plot(plot_lambda, train_ridge_error, label = "train error", marker='o', color = 'blue')
plt.plot(plot_lambda, test_ridge_error, label = "test error", marker='o', color = 'red')
plt.ylim(0, 1)
plt.xlabel('Value of ln Lambda')
plt.ylabel('Error RMS')
plt.legend()
```
``` #### Best Model

The best suitable lambda value for the model for best fit is 0.001 that returns the least test error of 0.2686822

#### Challenges faced

• When the train and test errors were normalized the graph was inverted where the test error was significantly lesser than train. This was a wrong implementation.
• I was unable to get smooth prediction cruve to indentify if the model was overfitting until I used np.linspace