Fitting batch norm into a neural network

For deep neural network, here is how batch normalization is implemented.

and parameters are:


then compute and update parameters

Note that these is nothing to do with for momentum

This can be easily implementated with NN framework. For example with tensorflow, you can use tf.nn.batch_normalization

Ioffe, Sergey and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” ICML (2015).

Working with mini-batches

Similarly, batch normalization can be applied to mini-batch as follows:

and parameters are

Notice that is computed as , and batch norm will look at the mini-batch and normalize to first of mean 0 and standard variance, and then a rescale by and . It means that whatever is the value of is actually going to just get subtracted out, because during that batch norm step, you are going to compute the means of the and subtract the mean. So adding any constant to all of the examples in the mini-batch won’t change anything. Therefore, the parameters are


and you will compute

Since the shape of and is () so the shape of and is also ().

Implementing gradient descent with batch norm

Assuming we are using mini-batch gradient descent:

for number of mini-batch

This also works with gradient descent with momentum, or RMSprop, or Adam. Where instead of taking this gradient descent update, mini-batch you could use the updates given by these other algorithms.