A few informal notes about the importance of attention heads in the transformer model.
What is an Attention Head?
An attention head is an independent unit within the multi-head attention mechanism. It computes attention scores between tokens in the input sequence to determine how much focus (or "attention") each token should give to every other token.
Each head operates in parallel but learns different types of relationships. For example, one head might focus on syntactic dependencies (like subject-verb relationships), while another might capture long-range dependencies in the sentence.
The QKV Formula in Multi-Head Attention
In multi-head self-attention, each input token's embedding is projected into three vectors:
Where:
- is the input embedding matrix (shape:
[batch_size, seq_len, hidden_dim]
). - are learned weight matrices that transform into the Query (), Key (), and Value () matrices.
- all have shape
[batch_size, seq_len, head_dim]
.
Defining and
- Each token’s query vector determines what it is looking for in other tokens.
- It is compared against key vectors (which describe how relevant other tokens are).
- The resulting attention scores tell the model where to focus.
If a token’s query is similar to another token’s key , then that token gets more attention.
Computing Attention
Self-attention is computed as:
- - computes similarity between queries and keys (how much token should attend to token ).
- Softmax - converts similarities into probabilities (higher values → more focus).
- Multiply by - the final attended representation is a weighted sum of values.
Example
Let’s say we have a sentence:
"The cat sat on the mat."
If "cat" has a strong query for subject-verb relations, it might attend to "sat", because "sat" has a corresponding key that aligns well.
The output for "cat" will therefore contain more information from "sat" than from unrelated words.
Softmax
Converts series of raw numbers into a probability distribution that adds to 1. This is simply explained with an example. If we have raw scores:
Softmax converts them into probabilities by using the data as exponents of . This ensures that the distribution is :
- Compute exponentials:
- Compute sum of exponentials:
- Compute softmax probabilities:
-
softmax( 2.0 )
= -
softmax( 1.0 )
= -
softmax( 0.1 )
=
Now, the values sum to 1.0:
This means:
- The highest score (2.0) gets the most weight (65.9%).
- The lowest score (0.1) gets the least weight (9.9%).
Why Euler's Constant?
- grows exponentially, which helps in probability weighting.
- Derivatives of are easy to compute, which makes training deep learning models efficient.
- ensures all outputs are positive, which is crucial for normalizing probabilities.
- The term computes the similarity between the query and key vectors.
- The normalizes these scores into probabilities.
- The result is a weighted sum of the values () based on attention scores.
- The derivative of is itself -
- The natural logarithm
- Compound interest calculations use -
What is Multi-Headed Attention?
Instead of using a single attention head, GPT-2 uses multiple heads (12 per layer) to capture different types of information.
- Each head has its own independent set of Q, K, and V weight matrices.
- They all process the input sequence separately.
- Their outputs are then concatenated and projected back into the model's representation space.
Outputs of Each Attention Head are Merged using Concatenation + Projection
After each head computes its weighted sum of values, their outputs are merged in two steps:
Step 1: Concatenation
Each attention head produces an output of shape (batch_size, seq_len, head_dim).
For multiple heads, we concatenate all head outputs along the hidden dimension:
If the model has 12 heads and a hidden size of 768, then:
- Each head operates on a subspace of size 64 (since ).
- After concatenation, we get back to (batch_size, seq_len, 768).
Step 2: Linear Projection
To mix the information from all heads, the concatenated output is projected through a learned weight matrix:
Where (the output projection matrix) has shape (hidden_size, hidden_size).
This ensures that:
- The individual head outputs are blended together.
- The final output remains the same dimensionality as the input.
What is Batch Size?
- Definition: The number of independent input sequences (examples) processed simultaneously in one forward pass of the model.
- Notation: batch size =
Why it matters:
- A larger batch size means faster training (better parallelization).
- A smaller batch size helps with generalization but slows training.
- Example: If batch size = 4, it means we are processing 4 sentences at the same time.
What is Sequence Length (seq_len)?
- Definition: The number of tokens in each input sequence.
- Notation: seq_len =
Why it matters:
- Shorter sequences are faster but may lose context.
- Longer sequences retain more context but increase memory usage.
- Example: - for "The cat sat on the mat.", if tokenized as
["The", "cat", "sat", "on", "the", "mat", "."]
, thenseq_len
= 7 (since there are 7 tokens) - For GPT-2 Small, the max sequence length is 1024 tokens.
What is Hidden Size?
- Definition: The number of dimensions in the vector representation of each token.
- Notation: hidden size =
Why it matters:
- Larger hidden sizes can store more information but increase compute cost.
- Smaller hidden sizes make models more efficient but less expressive.
- Example (GPT-2 Small):
Hidden size
= 768, meaning each token is represented by a 768-dimensional vector at every layer.