Mini Batch Gradient

Batch vs. mini-batch gradient descent

Vectorization allows efficient computation on examples.

What if ? You can get a faster algorithm if you let gradient descent start to make some progress even before you finish processing your entire, your giant training sets of 5 million examples.

Mini Batch: Small training sets, m=1,000

, etc. You would get

We can repeat this for

, etc. You would get

So, mini-batch :

Mini-batch gradient descent

for :

forward prop on





then compute cost =

Epoch Single path through training set

Understanding mini-batch gradient descent

With batch gradient descent on every iteration you go through the entire training set and you’d expect the cost to go down on every single iteration.

On mini batch gradient descent though, if you plot progress on your cost function, then it may not decrease on every iteration. If you plot J{t}, as you’re training mini batch in descent it may be over multiple epochs, you might expect to see a curve like this.

Choosing mini-batch size



  1. If you have a small training set: Us Batch Gradient Descent (m<2000)
  2. If you have big , then mini-batch size is 64, 128, 256, 512
  3. Make sure your mini-batch fit CPU and GPU memory