Backpropagation through softmax layer
Have you ever wondered, how can we backpropagate the gradient through a softmax layer? If you were to google it, you would find lots of articles (such as this one, which helped me a lot), but most of them prove the formula of the softmax’s derivative and then jump straight to the backpropagation of cross-entropy loss through the softmax layer. And while normalizing the networks’ output before computing the classification loss is the most common use of softmax, those formulas have little to do with the actual backpropagation through the softmax layer itself, more like the backpropagation through the cross-entropy loss.
Read more →