Pros, and cons of regularization techniques:

N.N
5 min readMar 15, 2021

In my previous post, I talked about the mechanics pros and cons of some optimization techniques, today I want to discuss some regularization techniques. I try to explain this in the simplest way possible.

Why regularizing?

in order to understand why we should regularize our functions, we must understand beforehand what the bias and variance are. Let’s suppose, for example, that you are training a model that can determine if a picture is one of a cat or a dog, or that is able to estimate whether a person is prone or not to certain illnesses given his age and weight, you give it some inputs and it returns the predictions, so, you make a table where you compare how many results are correct and incorrect.

You noticed that your model is very good at recognizing cats with an error rate of 8% but it’s terrible at recognizing dogs, with an error rate of 50 %, and what happens if you make it a multiclass model to also recognize horses, and now it turns out that the error rate on recognizing horses is superior to 60%, this doesn’t mean your model is good at recognizing cats, in general, it means that there is a high bias towards cats in your training model.

That doesn’t sound good, so going back to the original model, you spent a considerable amount of time training your model some more, you update your weights and finally, your model is able to recognize every image with an error rate near 0%. Sounds great, actually sounds too good to be true. Are we sure the model you just trained will perform with success on a new data set? or has it just become good at recognizing the images of the training set. So you go ahead and plot your function expecting to have some kind of linear trend but instead, you get some kind of weird line with no distinguishable trend, you want your model to be able to predict results with new data,

and when you try to feed your model with new dataset, you see that your model does not perform well.

source: Andrew Ng — Deeplearning.ai on youtube

this is where regularization comes in, you want to use regularization to reduce overfitting or variance in your model so that it is able to generalize better and predict better with data that has not been fed before.

to implement regularization is basically adding a term to our loss function that penalizes the model for having large weights (loss + x )

L2 regularization

the term (x) we add in our L2 regularization is the sum of the squared norms of the weight matrices, multiplied by a constant regularization parameter over 2 times the number of inputs of the network

x = (Σ |w|²) λ/2m

The norm, in general, is a function to assign a positive size to a vector. Lambtha is a parameter we give our regularization function an can be any positive number starting from 0

The intuition behind L1 (ridge) regression is that you trade a higher bias for less variance, this, ELI10 means: you have a linear trend in your testing set that may fit all the points

Source: Youtube: StatQuest

but if you use the same line to try to fit all the points in another set, it will do much worse.

Souce: Youtube — StatQuest

so, we modify the line in the test set (increase the bias) in order to make sure it performs better with other sets (reduce the variance)

L1 regularization

This one is similar to L2 regression, but instead of using the squared norms of the weights, we use the absolute value of the weights still it has the same objective as the l2 regression, increasing the bias in order to reduce the variance.

Dropout

the idea behind dropout regularization is to try to make our model generalize better by reducing the number of nodes in a certain layer by randomly dropping them out, hence the name, it’s done during the forward propagation phase by generating a creating a random binomial distribution using the keep probability and then multiplying the activated outputs of that layer by the new distribution and divide by the keep probability.

( A x D ) / k

Data Augmentation

This refers to the process of creating additional modified data, like, cropped, rotated, zoomed images, etc, in order to help our model in the generalization process.

Early Stopping

Why would we want to stop our model early? We could train our model over and over and we will notice that at some point we will have overtrained our model, early stopping stops the training process before our model starts overfitting, we could define a number of epochs (or passes of the training data) and this solves the problem of training indeterminately,

for epoch in range(5000):
train(X)

but we cannot make sure our model reached the local minimum of the cost function. The alternative is to stop when the updates in the cost become too small

X[i+1] — X[i] < 0.00001 ? stop : continue

This ensures the loss is minimized and that the model doesn’t waste computational power by iterating uselessly, but doesn’t prevent overfitting

So we have a third option which is to stop our function when we start to see a divergence between our training cost function and our validation cost function

As it is with optimization techniques, regularization techniques can be used in conjunction to increase the effectiveness of your model.

--

--

N.N

Software engineering student at Holberton school