Attention Heads

Some notes on how attention heads in a transformer model develop through training, are used in the model and combined to provide final weights.

January 12, 2025

A few informal notes about the importance of attention heads in the transformer model.

What is an Attention Head?

An attention head is an independent unit within the multi-head attention mechanism. It computes attention scores between tokens in the input sequence to determine how much focus (or "attention") each token should give to every other token.

Each head operates in parallel but learns different types of relationships. For example, one head might focus on syntactic dependencies (like subject-verb relationships), while another might capture long-range dependencies in the sentence.

The QKV Formula in Multi-Head Attention

In multi-head self-attention, each input token's embedding $X$ is projected into three vectors:

Q = XW_Q, \quad K = XW_K, \quad V = XW_V

Where:

$X$ is the input embedding matrix (shape: [batch_size, seq_len, hidden_dim]).
$W_Q, W_K, W_V$ are learned weight matrices that transform $X$ into the Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ) matrices.
$Q, K, V$ all have shape [batch_size, seq_len, head_dim].

Defining $Q$ and $K$

Each token’s query vector determines what it is looking for in other tokens.
It is compared against key vectors $K$ (which describe how relevant other tokens are).
The resulting attention scores tell the model where to focus.

If a token’s query $Q_i$ is similar to another token’s key $K_j$ , then that token gets more attention.

Computing Attention

Self-attention is computed as:

\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

$QK^T$ - computes similarity between queries and keys (how much token $i$ should attend to token $j$ ).
Softmax - converts similarities into probabilities (higher values → more focus).
Multiply by $V$ - the final attended representation is a weighted sum of values.

Example

Let’s say we have a sentence:

"The cat sat on the mat."

If "cat" has a strong query for subject-verb relations, it might attend to "sat", because "sat" has a corresponding key that aligns well.

$Q(\text{cat}) \approx K(\text{sat}) \rightarrow \text{Strong attention}$
$Q(\text{cat}) \neq K(\text{mat}) \rightarrow \text{Weak attention}.$

The output for "cat" will therefore contain more information from "sat" than from unrelated words.

Softmax

Converts series of raw numbers into a probability distribution that adds to 1. This is simply explained with an example. If we have raw scores:

z = [ 2.0, 1.0, 0.1 ]

Softmax converts them into probabilities by using the data as exponents of $e$ . This ensures that the distribution is :

Compute exponentials:

𝑒^2 = 7.389, 𝑒^1=2.718, 𝑒^{0.1}=1.105

Compute sum of exponentials:

7.389 + 2.718 + 1.105 = 11.212

Compute softmax probabilities:

softmax( 2.0 ) = $\frac{7.389}{11.212} = 0.659$
softmax( 1.0 ) = $\frac{2.718}{11.212} = 0.242$
softmax( 0.1 ) = $\frac{1.105}{11.212} = 0.099$

Now, the values sum to 1.0:

0.659 + 0.242 + 0.099 = 1.000

This means:

The highest score (2.0) gets the most weight (65.9%).
The lowest score (0.1) gets the least weight (9.9%).

Why $e$ Euler's Constant?

$𝑒^𝑥$ grows exponentially, which helps in probability weighting.
Derivatives of $𝑒^x$ are easy to compute, which makes training deep learning models efficient.
$e^x$ ensures all outputs are positive, which is crucial for normalizing probabilities.

e = \lim_{n \to \infty}\left( 1+\frac{1}{n} \right)^n

The $QK^T$ term computes the similarity between the query and key vectors.
The $\text{softmax}$ normalizes these scores into probabilities.
The result is a weighted sum of the values ( $V$ ) based on attention scores.
The derivative of $e^x$ is itself - $\fract{d}{dx}e^x = e^x$
The natural logarithm $ln e = 1$
Compound interest calculations use $e$ - $\text{Amount} = [\text{principal}]e^{[\text{rate}][\text{time}]}$

What is Multi-Headed Attention?

Instead of using a single attention head, GPT-2 uses multiple heads (12 per layer) to capture different types of information.

Each head has its own independent set of Q, K, and V weight matrices.
They all process the input sequence separately.
Their outputs are then concatenated and projected back into the model's representation space.

Outputs of Each Attention Head are Merged using Concatenation + Projection

After each head computes its weighted sum of values, their outputs are merged in two steps:

Step 1: Concatenation

Each attention head produces an output of shape (batch_size, seq_len, head_dim).

For multiple heads, we concatenate all head outputs along the hidden dimension:

\text{MultiHeadOutput} = \text{Concat} ( Head_1,Head_2, ..., Head_h )

If the model has 12 heads and a hidden size of 768, then:

Each head operates on a subspace of size 64 (since $768/12=64$ ).
After concatenation, we get back to (batch_size, seq_len, 768).

Step 2: Linear Projection

To mix the information from all heads, the concatenated output is projected through a learned weight matrix:

\text{Final Output} = 𝑊_𝑂 \cdot \text{MultiHeadOutput}

Where $W_O$ (the output projection matrix) has shape (hidden_size, hidden_size).

This ensures that:

The individual head outputs are blended together.
The final output remains the same dimensionality as the input.

What is Batch Size?

Definition: The number of independent input sequences (examples) processed simultaneously in one forward pass of the model.
Notation: batch size = $B$

Why it matters:

A larger batch size means faster training (better parallelization).
A smaller batch size helps with generalization but slows training.
Example: If batch size = 4, it means we are processing 4 sentences at the same time.

What is Sequence Length (seq_len)?

Definition: The number of tokens in each input sequence.
Notation: seq_len = $N$

Why it matters:

Shorter sequences are faster but may lose context.
Longer sequences retain more context but increase memory usage.
Example: - for "The cat sat on the mat.", if tokenized as ["The", "cat", "sat", "on", "the", "mat", "."], then seq_len = 7 (since there are 7 tokens)
For GPT-2 Small, the max sequence length is 1024 tokens.

What is Hidden Size?

Definition: The number of dimensions in the vector representation of each token.
Notation: hidden size = $d$

Why it matters:

Larger hidden sizes can store more information but increase compute cost.
Smaller hidden sizes make models more efficient but less expressive.
Example (GPT-2 Small): Hidden size = 768, meaning each token is represented by a 768-dimensional vector at every layer.

Originally posted: January 12, 2025

Filed Under:

transformer

llm

Attention Heads

What is an Attention Head?

The QKV Formula in Multi-Head Attention

Defining QQ and KK

Computing Attention

Example

Softmax

Why ee Euler's Constant?

What is Multi-Headed Attention?

Outputs of Each Attention Head are Merged using Concatenation + Projection

What is Batch Size?

What is Sequence Length (seq_len)?

What is Hidden Size?

Defining $Q$ and $K$

Why $e$ Euler's Constant?