August 23, 2021, Learn eTutorial

We already discussed the overfitting problem of a machine-learning model which makes the model inaccurate predictions. It will affect the efficiency of the model. 

What is regularization?

Regularization is the answer to the overfitting problem. We can say that regularization prevents the model overfitting problem by adding some more information into it. 

What is the overfitting of a machine-learning model?


Before going forward we need to know what the overfitting problem is. It happens when a machine learning model performs well with the training dataset and it fails or causes errors when we use the test dataset. We cannot able to use that model to predict the output of the real-valued input variables. To deal with that problem we can use regularization.

How does the regularization method help?

In the regularization method we are using all the features in the model but what we reduce is the magnitude of the features so that it can be a more accurate and more generalized model to give good results. In other words, regularization reduces to the usage of a more complex model so as to reduce the chance for overfitting.

Regularization methods add additional constraints to do two things: 

  1.     Solve an ill-posed problem (a problem without a unique and stable solution)
  2.     Prevent model overfitting

In machine learning, regularization problems impose an additional penalty on the cost function. This penalty controls the model complexity - larger penalties equal simpler models. This allows the model to not overfit the data and follows Occam's razor: the simple model is usually the most correct.

The general form of a regularization problem is shown below:


Where f(x) is a loss function (like the residual sum of squares), λ is the regularization term that controls the model complexity, and g(x) is another function that is a function of the regression coefficients. 

Different g(x) functions are essentially different machine learning algorithms.


Regularization techniques

We already discussed the two main techniques used in regularization, which are

  1. L2 regularization or Ridge regression
  2. L1 regularization or Lasso regression
  3. Dropout regularization

Ridge Regression

Ridge regression is one of the flexible and powerful regression analysis which is used when there is a high correlation between the input variables. If the co-linearity is very high, we will add some bias into the ridge regression method. The amount we add in the bias is called a penalty in ridge regression. Ridge regression has less susceptible to the overfitting problem. 

Ridge will help to solve problems with a large number of parameters and have a high correlation between them. Ridge is also used to reduce the complexity of a model that we call L2 regularization.

Ridge Regression is one form of regularization that reduces the impact of correlated features in the dataset. It does that by penalizing larger coefficients, shrinking the coefficients for non-informative specific features. 

Fumandentals and Interpretation

The mathematical formulation for Ridge regression is shown below:


The left-hand term is the residual sum of squares - a measure of the model error against the actual value. The right-hand term is the penalty term. Note that the penalty is a function of β^2, which acts as a constraint for the linear model. 

The take-home for Ridge regression is that the model prefers to take smaller β estimates that are close to 0, which reduces the penalty term.

Note about tuning the regularization parameter λ

The regularization term λ has a profound impact on model performance and should be selected with care. One method to do that is to simply see how different values for λ model performance. 

We'll write up a method to do just that using cross-validation, an essential strategy for evaluating models beyond this tutorial's scope. 

Model Training

We’ll use the Boston data described in the linear regression tutorial to predict the median house prices.

First, load the Boston dataset.

# Import the dataset
from sklearn.datasets import load_boston

data = load_boston() 

# Import general-use data science libraries
import pandas as pd
import numpy as np

# Import features
df = pd.DataFrame(, 

# Import target (median housing prices)
target = pd.DataFrame(, 

Then split into training and test sets.

from sklearn.model_selection import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(df, target, test_size=0.8, random_state=1)

print("Dataset sizes after train_test_split()")
print("Xtrain size: ", str(np.shape(Xtrain)))
print("Xtest size: ", str(np.shape(Xtest)))
print("Ytrain size: ", str(np.shape(Ytrain)))
print("Ytest size: ", str(np.shape(Ytest)))

Fit the Ridge model with hyperparameter tuning

# Fit the model
from sklearn.linear_model import RidgeCV
import numpy as np
lambdas = np.array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01,
      1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06])
ridge_mdl = RidgeCV(alphas=lambdas).fit(Xtrain, Ytrain)

Lasso Regression

Like the Ridge regression, Lasso regression also used to reduce the complexity of the model by adding some penalty. The only difference is we add the actual amount to a penalty in Lasso, as in ridge we use a square of the amount.

The Least Absolute Shrinkage and Selection Operation (LASSO) is another form of regularization. it allows regression coefficients to shrink all the way to 0. This makes the model more sparse and computationally efficient.  . Lasso regression is also called L1 regularization

Additionally, LASSO performs feature selection: features with a coefficient of 0 are essentially ignored.

Fumandentals and Interpretation

The mathematical formulation for LASSO is shown below:


The only thing that has changed from the Ridge formulation is the absolute value in the penalty term.  

The intuition of Ridge and LASSO through a geometry perspective


The graph above is a visual representation of LASSO (left) and Ridge (right) regression. The blue cloud represents the least-squares operation, and the center value is the least-squares estimate for β

However, the penalty term forces us not to take that value and instead choose values in the turquoise boundaries. The square term in Ridge regression results in a circle if we consider a two-dimensional problem (β1222). In contrast, the absolute value term in LASSO creates a diamond constraint. 

Using the same least-squares model, the circle constraint in Ridge regression makes it likely to shrink the coefficient w2 to be a small value approaching 0 (shown in red), but not necessarily hitting 0. In contrast, LASSO is more likely to penalize the coefficient w2 more harshly and set it to 0 if it is not informative.

Model Training

As we did with the linear regression and Ridge regression models, we’ll train LASSO on the Boston dataset with hyperparameter tuning.

# Fit the model
from sklearn.linear_model import LassoCV
import numpy as np
lambdas = np.array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01,
      1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06])
lasso_mdl = LassoCV(alphas=lambdas).fit(Xtrain, Ytrain)