Training set with two input features
Normalizing involves two steps:
If unnormalized, cost function would look like below.
If normalized, cost function would look like more symetrical.
If you’re running gradient descent on the cost function on unnormalized, then you might have to use a very small learning rate because the gradient descent might need a lot of steps to oscillate back and forth before it finally finds its way to the minimum. Whereas if you have a more spherical contours in normalized, then wherever you start gradient descent can pretty much go straight to the minimum.