Training a softmax classifier

Understanding Softmax

Let

       

then

Softmax regression generalizes logistic regression to C classes. If C=2 then, softmax regression reduces to logistic regression.

How would train a neural network with a softmax output layer

Loss function

Let

        

This example is not good one since the probability of y=1 is 0.2.

With given example, and , so and needs to be as big as possible to make small. This is reasonable as has to be close to 1. This is a form of maximum likelyhood estimation.

Cost function

is (4, m) dimention

has also (4, m) dimention

How do you implement gradient descent with softmax

Backprop: