Understanding positional encoding

The transformer model introduced in Attention Is All You Need uses positional encoding to enreach token embeddings with positional information. Authors note that there are several possible implementations of positional encoding, one of the most obvious ones being the trainable embedding layer. However, there are drawbacks to this approach, such as inability of model to work with sequences of length more than in training examples. Hence, authors search for alternative methods and settle for the following:

$$ PE_{(pos,2i)} = \sin \left( \frac{1}{10000^{2i/d_{\text{model}}}} pos \right) = \sin(\omega_i \cdot pos) $$$$ PE_{(pos,2i+1)} = \cos \left( \frac{1}{10000^{2i/d_{\text{model}}}} pos \right) = \cos(\omega_i \cdot pos) $$ Read more →