Backpropagation through softmax layer

Have you ever wondered, how can we backpropagate the gradient through a softmax layer? If you were to google it, you would find lots of articles (such as this one, which helped me a lot), but most of them prove the formula of the softmax’s derivative and then jump straight to the backpropagation of cross-entropy loss through the softmax layer. And while normalizing the networks’ output before computing the classification loss is the most common use of softmax, those formulas have little to do with the actual backpropagation through the softmax layer itself, more like the backpropagation through the cross-entropy loss.

Read more →

Confusing detail about chain rule in linear layer backpropagation

Have you ever been wondering, how come gradients for a linear layer $Y = XW$ have this weird formulas?

$$ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} W^T,\ \frac{\partial L}{\partial W} = X^T \frac{\partial L}{\partial Y} $$

I mean the first one is easy: we just apply the chain rule et voila:

$$ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial X} = \frac{\partial L}{\partial Y} W^T $$ Read more →