Let
then
Softmax regression generalizes logistic regression to C classes. If C=2 then, softmax regression reduces to logistic regression.
Let
This example is not good one since the probability of y=1 is 0.2.
With given example, and , so and needs to be as big as possible to make small. This is reasonable as has to be close to 1. This is a form of maximum likelyhood estimation.
is (4, m) dimention
has also (4, m) dimention
Backprop: