Weight Initialization for Deep networks

assuming

In order to avoid vanishing and explding gradient:

One reasonable thing to do would be to set the variance of , where is the number of input features that’s going into a neuron.

wl=np.random.rand(shape)*np.sqrt(1/n^(l-1))

For ReLU activation, variance of works better.

wl=np.random.rand(shape)*np.sqrt(2/n^(l-1))

For tanh activation, variance of works better. This is called Xavier initialization.

wl=np.random.rand(shape)*np.sqrt(1/n^(l-1))

Another version we’re taught by Yoshua Bengio for tanh activation variance of .

wl=np.random.rand(shape)*np.sqrt(2/(n^(l-1)+n^l))

This doesn’t solve valinishing/exploding gradients, but it helps.

This variance is one of the hyperparameter but the effect can be small.