Regularization

To reduce variance or prevent overfitting, regularization is one of the tool to be used.

Logistic Regression

In logistic regression, we want to minimize the cost function.

where and .

To add regularization to the logistic regression, you add times the norm of squared. is called the regularization parameter.

L2 Regularization: . is much more commong in deep neural network$

L1 Regularization: .

L1 regularization will make sparse. (More zeros)

Neural Network

In neural network, we have a cost function;

To add regularization to the neural network, you add times .

where because w: matrix. This is called

from backpropagation and parameter .

L2 regularization is also sometimes called .

Mathmatically,

It is called ‘weight decay’ because the first tem is multipling which is less than

Why regularization reduces overfitting?

So why is it that shrinking the L two norm or the Frobenius norm or the parameters might cause less overfitting? It sets the weight to be so close to zero for a lot of hidden units that’s basically zeroing out a lot of the impact of these hidden units. And if that’s the case, then this much simplified neural network becomes a much smaller neural network. In fact, it is almost like a logistic regression unit, but stacked most probably as deep.

Another reason

g(z): tanh(z) When z is close to zero, then we are using the linear regime of the tanh function. If is large, then is small, and is also small since , and if z is closer to 0, then every layer is linear.