Math — Vadim Shiianov's blog

Backpropagation through softmax layer

September 26, 2021 deep learning math

Have you ever wondered, how can we backpropagate the gradient through a softmax layer? If you were to google it, you would find lots of articles (such as this one, which helped me a lot), but most of them prove the formula of the softmax’s derivative and then jump straight to the backpropagation of cross-entropy loss through the softmax layer. And while normalizing the networks’ output before computing the classification loss is the most common use of softmax, those formulas have little to do with the actual backpropagation through the softmax layer itself, more like the backpropagation through the cross-entropy loss.

Confusing detail about chain rule in linear layer backpropagation

September 21, 2021 deep learning math

Have you ever been wondering, how come gradients for a linear layer $Y = XW$ have this weird formulas?

$$ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} W^T,\ \frac{\partial L}{\partial W} = X^T \frac{\partial L}{\partial Y} $$

I mean the first one is easy: we just apply the chain rule et voila:

$$ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial X} = \frac{\partial L}{\partial Y} W^T $$ Read more →