Regularization in Machine Learning

Edna Figueira Fernandes
2 min readJun 28, 2020

--

An important aspect to take into account when building a machine learning model is overfitting. Overfitting happens when the model learns too much noise from the data and when tested on new data, it performs poorly. One technique used to avoid overfitting is regularization and this will be the focus of this blog post.

Regularization is a technique used to discourage the model from learning too much from the data and therefore, reduce the risk of overfitting. There are two types of regularization: lasso regression and ridge regression.

Lasso Regression, also known as L1 regularization, penalizes the model for using “extra features”.

The goal of a linear regression model is to minimize the sum of the squared distance between the line and each of the data points. Lasso goes in a step further and also minimizes the number of features being used. Essentially, it adds a penalty variable (lambda) which is dependent on the number of features being used. The higher the number of features, the higher the penalty. This penalty is multiplied by the sum of the absolute values of the coefficients. Basically, the coefficients are added one at a time, and if adding a coefficient “does not improve the fit enough to outweigh the penalty”, that coefficient gets a value of zero.

The concept behind ridge regression (L2 regularization) is the same as Lasso’s but instead of adding the absolute values of the coefficients, it adds the squared of the coefficients with the intent of keeping all the coefficients small. Therefore, it does not reduce the number of features used in the model.

How to choose between the two?

Use L1 when you feel that your dataset has too many features and you want to reduce its size. This type of regularization would be good for feature selection in which you are trying to keep the most relevant features in your dataset for your model. If this is not the case, choose L2 regularization. This is the type of regularization that is used the most. It tries to keep the coefficients homogeneously small and tends to give better results with predictive modeling.

References

--

--

No responses yet