Overfitting using higher order Linear Regression

8th April 2021
The reason for a poor prediction of a machine learning model is due to overfitting or underfitting. In this post lets go over these scenarios.

My Contribution

Underfitting

A model is said to be underfitting when it cannnot generalize neither on the training data nor on the unseen data. In this case there loss is high for both cases. An underfitting model is not a suitable model and will be obvious as it will have a poor performance. Often times this scenario is easily overcome to give good performance just by tweaking the hyperparameter or try different machine learning models.

Overfitting

This scenario refers to when a model performs well only for the training dataset but not for any unseen data. This happens when a model learns the noise and details in the training dataset to the extent where it predicts poorly for new data. In many situation a developer can run into overfitting a model when the model is built very complex for the training data or there is very few training data.

How to solve the problem of overfitting ?

  • Increase training dataset
  • Model regularization
  • Reduce model complexity
  • Experiment and Results

    Import Packages

                        
                            import numpy as np
                            import matplotlib.pyplot as plt
                            import math
                            import pandas as pd
                            
                            from sklearn.linear_model import LinearRegression
                            from sklearn.metrics import mean_squared_error
                            from sklearn.preprocessing import PolynomialFeatures
                            from sklearn.model_selection import train_test_split
                            from sklearn.linear_model import Ridge
                        
                    

    Define true function to plot on the graph

                        
                            def true_function():
                                t = np.linspace(0, 1, 100)
                                s=[]
                                for i in range(0,100):
                                    s.append(np.sin(2 * np.pi * t[i]))
                                plt.plot(t, s, label='True function', color='black')
                        
                    

    Generate 20 data pairs (X, Y) using y = sin(2*pi*X) + 0.1 * N

    • Use uniform distribution between 0 and 1 for X
    • Sample N from the normal gaussian distribution
    • Use 10 for train and 10 for test

                        
                            x_data = np.random.uniform(0, 1, 20)
                            y_data = np.sin(2 * np.pi * x_data) + np.random.normal(0, 1, 20) * 0.2
                            plt.scatter(x_data, y_data, label='samples')
                            true_function()
                            plt.legend()
    
                            x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.5, random_state=42)
    
                            x_train, y_train = zip(*sorted(zip(x_train, y_train)))
                            x_test, y_test = zip(*sorted(zip(x_test, y_test)))
                        
                    

    Lets define a fucntion to calcuate root mean squared error as we are going to use that often

                        
                            def cal_rmse(true, pred):
                                return np.sqrt(mean_squared_error(true, pred))
                        
                    

    Fit model for degree = 0

                        
                            polynomial_features = PolynomialFeatures(0)
                            x_reshape = polynomial_features.fit_transform(np.asarray(x_train).reshape(-1,1))
                            model0 = LinearRegression()
                            model0.fit(x_reshape, y_train)
    
                            x = np.linspace(0, 1, 100)
                            x_plot = polynomial_features.fit_transform(x.reshape(-1, 1))
                            y_pred_plot = model0.predict(x_plot)
    
                            y_pred = model0.predict(x_reshape)
                            train_rmse = cal_rmse(y_train, y_pred)
                            print("Train RMSE : ", train_rmse)
    
                            plt.plot(x, y_pred_plot, label="Model")
                            plt.scatter(x_train, y_train, c='green', label='Samples')
                            true_function()
                            plt.title('M=0 Train')
                            plt.xlabel('x')
                            plt.ylim(-1.5 ,1.5)
                            plt.ylabel('y')
                            plt.legend()
                            plt.show()
                        
                    

    Fit model for degree = 1

                        
                            polynomial_features = PolynomialFeatures(1)
                            x_reshape= polynomial_features.fit_transform(np.asarray(x_train).reshape(-1,1))
                            model1 = LinearRegression()
                            model1.fit(x_reshape, y_train)
    
                            x = np.linspace(0, 1, 100)
                            x_plot = polynomial_features.fit_transform(x.reshape(-1, 1))
                            y_pred_plot = model1.predict(x_plot)
    
                            y_pred = model1.predict(x_reshape)
                            train_rmse = cal_rmse(y_train, y_pred)
                            print("Train RMSE : ", train_rmse)
    
                            plt.plot(x, y_pred_plot, label="Model")
                            plt.scatter(x_train, y_train, c='green', label='Samples')
                            true_function()
                            plt.title('M=1 Train')
                            plt.xlabel('x')
                            plt.ylim(-1.5 ,1.5)
                            plt.ylabel('y')
                            plt.legend()
                            plt.show()
                        
                    

    Fit model for degree = 3

                        
                            polynomial_features = PolynomialFeatures(3)
                            x_reshape= polynomial_features.fit_transform(np.asarray(x_train).reshape(-1,1))
                            model3 = LinearRegression()
                            model3.fit(x_reshape, y_train)
    
                            x = np.linspace(0, 1, 100)
                            x_plot = polynomial_features.fit_transform(x.reshape(-1, 1))
                            y_pred_plot = model3.predict(x_plot)
    
                            y_pred = model3.predict(x_reshape)
                            train_rmse = cal_rmse(y_train, y_pred)
                            print("Train RMSE : ", train_rmse)
    
                            plt.plot(x, y_pred_plot, label="Model")
                            plt.scatter(x_train, y_train, c='green', label='Samples')
                            true_function()
                            plt.title('M=3 Train')
                            plt.xlabel('x')
                            plt.ylim(-1.5 ,1.5)
                            plt.ylabel('y')
                            plt.legend()
                            plt.show()
                        
                    

    Fit model for degree = 9

                        
                            polynomial_features = PolynomialFeatures(9)
                            x_reshape= polynomial_features.fit_transform(np.asarray(x_train).reshape(-1,1))
                            model9 = LinearRegression()
                            model9.fit(x_reshape, y_train)
    
                            x = np.linspace(0, 1, 100)
                            x_plot = polynomial_features.fit_transform(x.reshape(-1, 1))
                            y_pred_plot = model9.predict(x_plot)
    
                            y_pred = model9.predict(x_reshape)
                            train_rmse = cal_rmse(y_train, y_pred)
                            print("Train RMSE : ", train_rmse)
    
                            plt.plot(x, y_pred_plot, label="Model")
                            plt.scatter(x_train, y_train, c='green', label='Samples')
                            true_function()
                            plt.title('M=9 Train')
                            plt.xlabel('x')
                            plt.ylim(-1.5 ,1.5)
                            plt.ylabel('y')
                            plt.legend()
                            plt.show()
                        
                    

    Now lets plot the train against test error to indentify which model is overfitting. Any guesses before that?
    If you said model of degree 9 you are right.

                        
                            def gen_train_test_loss():
                                train_error = []
                                test_error = []
                                for i in range(0, 10):
                                    polynomial_features = PolynomialFeatures(degree=i)
                                    x_train_poly = polynomial_features.fit_transform(np.asarray(x_train).reshape(-1, 1))
                                    x_test_poly = polynomial_features.fit_transform(np.asarray(x_test).reshape(-1, 1))
    
                                    poly_model = LinearRegression()
                                    poly_model.fit(x_train_poly, y_train)
                                    y_train_pred = poly_model.predict(x_train_poly)
    
                                    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
                                    train_error.append(train_rmse)
    
                                    y_test_pred = poly_model.predict(x_test_poly)
                                    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
                                    test_error.append(test_rmse)
                                
                                return train_error, test_error
    
                            train_error, test_error = gen_train_test_loss()
                            plt.plot(np.arange(0, 10, 1), train_error, color='r', label='Train Loss', marker='o')
                            plt.plot(np.arange(0, 10, 1), test_error, color='m', label='Test Loss', marker='o')
                            plt.title('Train vs Test Loss')
                            plt.xlabel('Degree')
                            plt.ylabel('Loss')
                            plt.ylim(0, 1)
                            plt.legend(loc='upper left')
                            plt.show()
                        
                    

    Display weights in table

                        
                            weights_frame = pd.DataFrame(np.ndarray((10,5)))
                            weights_frame.columns = ['weights','M=0','M=1','M=3','M=9']
                            weights_frame['weights'] = ['W'+str(index) for index in range(10)]
    
                            weights_frame['M=0'] = model0.coef_.tolist() + [0]*(10-len(model0.coef_))
                            weights_frame['M=1'] = model1.coef_.tolist() + [0]*(10-len(model1.coef_))
                            weights_frame['M=3'] = model3.coef_.tolist() + [0]*(10-len(model3.coef_))
                            weights_frame['M=9'] = model9.coef_.tolist() + [0]*(10-len(model9.coef_))
    
                            with pd.option_context('display.float_format', '{:0.2f}'.format):
                                print(weights_frame.replace(0,'-').head(10))
                        
                    

    Now lets generate 100 more data and fit 9th order model and draw fit

                        
                            x_data_100 = np.random.uniform(0,1, 100)
                            N_100 = np.random.normal(0, 1, 100)
                            y_data_100 = np.sin(2 * np.pi * x_data_100) + N_100 * 0.3
                            plt.scatter(x_data_100, y_data_100)
                            true_function()
    
                            x_data_100, y_data_100 = zip(*sorted(zip(x_data_100, y_data_100)))
    
                            polynomial_features = PolynomialFeatures(degree=9)
                            x_train_poly_100 = polynomial_features.fit_transform(np.asarray(x_data_100).reshape(-1, 1))
    
                            model9 = LinearRegression()
                            model9.fit(x_train_poly_100, y_data_100)
    
    
                            x = np.linspace(0, 1, 250)
                            x_plot = polynomial_features.fit_transform(x.reshape(-1, 1))
                            y_pred_plot = model9.predict(x_plot)
    
                            y_pred = model9.predict(x_train_poly_100)
                            train_rmse = cal_rmse(y_data_100, y_pred)
                            print("Train RMSE : ", train_rmse)
    
                            plt.scatter(x_data_100, y_data_100, s=25, label='Samples', color='orange')
                            plt.plot(x, y_pred_plot, color='red', label='Model')
                            true_function()
                            plt.title('M=9 N=100 Train')
                            plt.xlabel('x')
                            plt.ylabel('y')
                            plt.legend(loc='upper right')
                            plt.show()
                        
                    

    Now we will regularize using the sum of weights using ridge model and plot graphs for different lambda value.

                        
                            def plot_lam_graphs(deg, lambda_value):
                                polynomial_features= PolynomialFeatures(degree = deg)
                                x_reshape = polynomial_features.fit_transform(np.asarray(x_train).reshape(-1, 1))
                                ridge_model = Ridge(alpha=lambda_value)
                                ridge_model.fit(x_reshape, y_train)
                                
                                x = np.linspace(0, 1, 100)
                                x_train_plot = polynomial_features.fit_transform(np.asarray(x).reshape(-1, 1))
                                Y_pred = ridge_model.predict(x_train_plot)
                                
                                plt.plot(x, Y_pred, color='r',label="ln lamda = "+str(lambda_value))
                                plt.scatter(x_train, y_train,label='Samples')
                                true_function()
                                plt.xlabel('x')
                                plt.ylabel('y')
                                plt.title("M=9")
                                plt.legend()
                                plt.show()
    
                            lambda_values = np.array([1, 0.1, 0.01, 0.001, 0.0001, 0.00001])
                            for i in lambda_values:
                                plot_lam_graphs(9, i)
                        
                    
    Lambda = 1
    Lambda = 0.1
    Lambda = 0.01
    Lambda = 0.001
    Lambda = 0.0001
    Lambda = 0.00001

    Train and Test error for different lambda values

                        
                            train_ridge_error = []
                            test_ridge_error  = []
    
                            def calc_ridge_error(Lambda):
                                polynomial_features = PolynomialFeatures(9)
                                x_train_reshape = polynomial_features.fit_transform(np.asarray(x_train).reshape(-1,1))
                                x_test_reshape = polynomial_features.fit_transform(np.asarray(x_test).reshape(-1,1))
                                ridge_model = Ridge(Lambda)
                                ridge_model.fit(x_train_reshape, y_train)
                                y_train_pred = ridge_model.predict(x_train_reshape)
                                y_test_pred = ridge_model.predict(x_test_reshape)
                                train_rmse = cal_rmse(y_train, y_train_pred)
                                test_rmse = cal_rmse(y_test, y_test_pred)
                                train_ridge_error.append(train_rmse)
                                test_ridge_error.append(test_rmse)
    
    
                            lambda_param = [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]
                            plot_lambda = [ np.log(1), np.log(1/10), np.log(1/100), np.log(1/1000), np.log(1/10000), np.log(1/100000) ]
                            for i in lambda_param:
                                calc_ridge_error(i)
    
                            plt.plot(plot_lambda, train_ridge_error, label = "train error", marker='o', color = 'blue')
                            plt.plot(plot_lambda, test_ridge_error, label = "test error", marker='o', color = 'red')
                            plt.ylim(0, 1)
                            plt.xlabel('Value of ln Lambda')
                            plt.ylabel('Error RMS')
                            plt.legend()
                        
                    

    Best Model

    The best suitable lambda value for the model for best fit is 0.001 that returns the least test error of 0.2686822

    Challenges faced

    • When the train and test errors were normalized the graph was inverted where the test error was significantly lesser than train. This was a wrong implementation.
    • I was unable to get smooth prediction cruve to indentify if the model was overfitting until I used np.linspace

    References

    1. https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#sphx-glr-auto-examples-model-selection-plot-underfitting-overfitting-py
    2. https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/
    3. https://towardsdatascience.com/polynomial-regression-bbe8b9d97491
    4. https://utkuufuk.com/2018/05/04/learning-curves/
    5. https://www.kaggle.com/mtax687/l2-regularization-of-neural-network-using-numpy
    6. https://towardsdatascience.com/polynomial-regression-with-scikit-learn-what-you-should-know-bed9d3296f2