Positional Encoding in Transformers

Scroll to zoom, drag to pan. Click a node for details.

Deep Dive: Positional Encoding

The Problem: Transformers are Order-Blind (1/10)

Let's talk Transformers! At their core is the Self-Attention mechanism. It's powerful because it lets every word in a sentence look at every other word, all at once. 🚀

But there's a catch: this process is "permutation-invariant." It sees words as a "bag of words." To a Transformer, "The cat sat on the mat" and "The mat sat on the cat" look the same without extra help. We need a way to tell it the word order!

The Solution: Positional Encoding (PE) (2/10)

Enter Positional Encoding. It's a clever trick to inject information about a token's position into the model. We create a special "position vector" for each spot in the sequence (1st, 2nd, 3rd, etc.). This vector is then added directly to the token's Word Embedding.

Final Input = Word Embedding + Positional Encoding. Now, the vector for "cat" at position 2 is different from "cat" at position 5. Problem solved! ✅

The Math: Sinusoidal Functions (3/10)

How do we create these position vectors? The "Attention is All You Need" paper uses a brilliant method with sine and cosine waves of different frequencies. For a token at position pos and dimension i in the embedding vector, the formulas are:

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$

$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$

Here, $d_{\text{model}}$ is the embedding dimension. Even dimensions get the sine wave, odd dimensions get the cosine wave.

Why This Formula? The Magic of Relative Position (4/10)

This isn't random! This sinusoidal choice has a killer feature: it makes it easy for the model to learn Relative Positions. Why? Because for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear transformation of $PE_{pos}$.

This means the relationship between position 5 and 7 is the same as between position 12 and 14. The attention mechanism can easily learn to focus on "the word 4 positions ahead." This is HUGE for understanding grammar and context.

Visualizing the Encodings (5/10)

Imagine the positional encoding matrix. Each row is a position in the sequence, and each column is a dimension in the embedding. The sine/cosine functions create a unique, wave-like pattern. Early dimensions have low-frequency waves (changing slowly), while later dimensions have high-frequency waves (changing rapidly). This gives each position a unique fingerprint.

Implementation in Code (6/10)

Let's see what this looks like in Python using PyTorch. We pre-compute the entire PE matrix and just add it to our input embeddings. It's surprisingly straightforward!

import torch
import math

class PositionalEncoding(torch.nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x is shape: [seq_len, batch_size, d_model]
        x = x + self.pe[:x.size(0), :]
        return x

The Role of `d_model` (7/10)

The Embedding Dimension ($d_{\text{model}}$) is critical. It's the size of your word embedding vectors and, therefore, your positional encoding vectors. A larger $d_{\text{model}}$ (e.g., 512, 768) gives the model more "space" to encode both the word's meaning and its position without them interfering with each other.

Why Addition and Not Concatenation? (8/10)

A great question! We add the PE to the word embedding. Why not concatenate? Addition is more parameter-efficient. It allows the model to flexibly learn how much attention to pay to the meaning vs. the position. Concatenation would create a larger vector and might force a more rigid separation between the two types of information early on.

Fixed vs. Learned Positional Encoding (9/10)

The sinusoidal method is a type of Fixed Positional Encoding – the values are pre-calculated and don't change during training. An alternative is Learned Positional Encoding (used in models like BERT and GPT). Here, the position vectors are initialized randomly and updated during training just like any other model parameter.

Fixed (Sinusoidal): Can generalize to longer sequences. No parameters needed.
Learned: May adapt better to the specific task/data, but requires more parameters and might not generalize as well to unseen sequence lengths.

Conclusion: The Unsung Hero (10/10)

And that's Positional Encoding! It's a simple but brilliant solution to a fundamental problem in the Transformer architecture. By cleverly using sine and cosine waves, it gives the model a sense of order, unlocking its ability to understand complex sequences. It's truly one of the unsung heroes of the NLP revolution! 🏆