Diffusion models

In this article I want to tell you about diffusion models, which is an actively developing approach to image generation. Recent research shows that this paradigm can generate images of quality on par with or even exceeding the one of the best GANs. Moreover, the design of such models allows them to surpass two main GANs’ weaknesses, i.e. mode collapsing and sensitivity to hyperparameter choice. However, the same design, that makes diffusion models so powerful, makes them considerably slower on inference.

intro

| Table taken from Aran Komatsuzaki’s blog post. |

Read more →

Understanding positional encoding

The transformer model introduced in Attention Is All You Need uses positional encoding to enreach token embeddings with positional information. Authors note that there are several possible implementations of positional encoding, one of the most obvious ones being the trainable embedding layer. However, there are drawbacks to this approach, such as inability of model to work with sequences of length more than in training examples. Hence, authors search for alternative methods and settle for the following:

$$ PE_{(pos,2i)} = \sin \left( \frac{1}{10000^{2i/d_{\text{model}}}} pos \right) = \sin(\omega_i \cdot pos) $$$$ PE_{(pos,2i+1)} = \cos \left( \frac{1}{10000^{2i/d_{\text{model}}}} pos \right) = \cos(\omega_i \cdot pos) $$ Read more →

Normalizing flows in simple words

Suppose we have a sample of objects $X = \{x_i\}_{i=1}^n$ that come from an unknown distribution $p_x(x)$ and we want our model to learn this distribution. What do I mean by learning a distribution? There are many ways to define such task, but data scientists mostly settle for 2 things:

  1. learning to score the objects’ probability, i.e. learning the probability density function $p_x(x)$, and/or
  2. learning to sample from this unknown distribution, which implies the ability to sample new, unseen objects.

Does this description ring a bell? Yes, I’m talking precisely about generative models!

Read more →

Backpropagation through softmax layer

Have you ever wondered, how can we backpropagate the gradient through a softmax layer? If you were to google it, you would find lots of articles (such as this one, which helped me a lot), but most of them prove the formula of the softmax’s derivative and then jump straight to the backpropagation of cross-entropy loss through the softmax layer. And while normalizing the networks’ output before computing the classification loss is the most common use of softmax, those formulas have little to do with the actual backpropagation through the softmax layer itself, more like the backpropagation through the cross-entropy loss.

Read more →

Confusing detail about chain rule in linear layer backpropagation

Have you ever been wondering, how come gradients for a linear layer $Y = XW$ have this weird formulas?

$$ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} W^T,\ \frac{\partial L}{\partial W} = X^T \frac{\partial L}{\partial Y} $$

I mean the first one is easy: we just apply the chain rule et voila:

$$ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial X} = \frac{\partial L}{\partial Y} W^T $$ Read more →