Attention Heads

Some notes on how attention heads in a transformer model develop through training, are used in the model and combined to provide final weights.

A few informal notes about the importance of attention heads in the transformer model.

What is an Attention Head?

An attention head is an independent unit within the multi-head attention mechanism. It computes attention scores between tokens in the input sequence to determine how much focus (or "attention") each token should give to every other token.

Each head operates in parallel but learns different types of relationships. For example, one head might focus on syntactic dependencies (like subject-verb relationships), while another might capture long-range dependencies in the sentence.

The QKV Formula in Multi-Head Attention

In multi-head self-attention, each input token's embedding XX is projected into three vectors:

Q=XWQ,K=XWK,V=XWVQ = XW_Q, \quad K = XW_K, \quad V = XW_V

Where:

  • XX is the input embedding matrix (shape: [batch_size, seq_len, hidden_dim]).
  • WQ,WK,WVW_Q, W_K, W_V are learned weight matrices that transform XX into the Query (QQ), Key (KK), and Value (VV) matrices.
  • Q,K,VQ, K, V all have shape [batch_size, seq_len, head_dim].

Defining QQ and KK

  • Each token’s query vector determines what it is looking for in other tokens.
  • It is compared against key vectors KK (which describe how relevant other tokens are).
  • The resulting attention scores tell the model where to focus.

If a token’s query QiQ_i is similar to another token’s key KjK_j, then that token gets more attention.

Computing Attention

Self-attention is computed as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V
  • QKTQK^T - computes similarity between queries and keys (how much token ii should attend to token jj).
  • Softmax - converts similarities into probabilities (higher values → more focus).
  • Multiply by VV - the final attended representation is a weighted sum of values.

Example

Let’s say we have a sentence:

"The cat sat on the mat."

If "cat" has a strong query for subject-verb relations, it might attend to "sat", because "sat" has a corresponding key that aligns well.

  • Q(cat)K(sat)Strong attentionQ(\text{cat}) \approx K(\text{sat}) \rightarrow \text{Strong attention}
  • Q(cat)K(mat)Weak attention.Q(\text{cat}) \neq K(\text{mat}) \rightarrow \text{Weak attention}.

The output for "cat" will therefore contain more information from "sat" than from unrelated words.


Softmax

Converts series of raw numbers into a probability distribution that adds to 1. This is simply explained with an example. If we have raw scores:

z=[2.0,1.0,0.1]z = [ 2.0, 1.0, 0.1 ]

Softmax converts them into probabilities by using the data as exponents of ee. This ensures that the distribution is :

  1. Compute exponentials:
𝑒2=7.389,𝑒1=2.718,𝑒0.1=1.105 𝑒^2 = 7.389, 𝑒^1=2.718, 𝑒^{0.1}=1.105
  1. Compute sum of exponentials:
7.389+2.718+1.105=11.2127.389 + 2.718 + 1.105 = 11.212
  1. Compute softmax probabilities:
  • softmax( 2.0 ) = 7.38911.212=0.659\frac{7.389}{11.212} = 0.659

  • softmax( 1.0 ) = 2.71811.212=0.242\frac{2.718}{11.212} = 0.242

  • softmax( 0.1 ) = 1.10511.212=0.099\frac{1.105}{11.212} = 0.099

Now, the values sum to 1.0:

0.659+0.242+0.099=1.0000.659 + 0.242 + 0.099 = 1.000

This means:

  • The highest score (2.0) gets the most weight (65.9%).
  • The lowest score (0.1) gets the least weight (9.9%).

Why ee Euler's Constant?

  • 𝑒𝑥𝑒^𝑥 grows exponentially, which helps in probability weighting.
  • Derivatives of 𝑒x𝑒^x are easy to compute, which makes training deep learning models efficient.
  • exe^x ensures all outputs are positive, which is crucial for normalizing probabilities.
e=limn(1+1n)ne = \lim_{n \to \infty}\left( 1+\frac{1}{n} \right)^n
  • The QKTQK^T term computes the similarity between the query and key vectors.
  • The softmax\text{softmax} normalizes these scores into probabilities.
  • The result is a weighted sum of the values (VV) based on attention scores.
  • The derivative of exe^x is itself - \fractddxex=ex\fract{d}{dx}e^x = e^x
  • The natural logarithm lne=1ln e = 1
  • Compound interest calculations use ee - Amount=[principal]e[rate][time]\text{Amount} = [\text{principal}]e^{[\text{rate}][\text{time}]}

What is Multi-Headed Attention?

Instead of using a single attention head, GPT-2 uses multiple heads (12 per layer) to capture different types of information.

  • Each head has its own independent set of Q, K, and V weight matrices.
  • They all process the input sequence separately.
  • Their outputs are then concatenated and projected back into the model's representation space.

Outputs of Each Attention Head are Merged using Concatenation + Projection

After each head computes its weighted sum of values, their outputs are merged in two steps:

Step 1: Concatenation

Each attention head produces an output of shape (batch_size, seq_len, head_dim).

For multiple heads, we concatenate all head outputs along the hidden dimension:

MultiHeadOutput=Concat(Head1,Head2,...,Headh)\text{MultiHeadOutput} = \text{Concat} ( Head_1,Head_2, ..., Head_h )

If the model has 12 heads and a hidden size of 768, then:

  • Each head operates on a subspace of size 64 (since 768/12=64768/12=64).
  • After concatenation, we get back to (batch_size, seq_len, 768).

Step 2: Linear Projection

To mix the information from all heads, the concatenated output is projected through a learned weight matrix:

Final Output=𝑊𝑂MultiHeadOutput\text{Final Output} = 𝑊_𝑂 \cdot \text{MultiHeadOutput}

Where WOW_O (the output projection matrix) has shape (hidden_size, hidden_size).

This ensures that:

  • The individual head outputs are blended together.
  • The final output remains the same dimensionality as the input.

What is Batch Size?

  • Definition: The number of independent input sequences (examples) processed simultaneously in one forward pass of the model.
  • Notation: batch size = BB

Why it matters:

  • A larger batch size means faster training (better parallelization).
  • A smaller batch size helps with generalization but slows training.
  • Example: If batch size = 4, it means we are processing 4 sentences at the same time.

What is Sequence Length (seq_len)?

  • Definition: The number of tokens in each input sequence.
  • Notation: seq_len = NN

Why it matters:

  • Shorter sequences are faster but may lose context.
  • Longer sequences retain more context but increase memory usage.
  • Example: - for "The cat sat on the mat.", if tokenized as ["The", "cat", "sat", "on", "the", "mat", "."], then seq_len = 7 (since there are 7 tokens)
  • For GPT-2 Small, the max sequence length is 1024 tokens.

What is Hidden Size?

  • Definition: The number of dimensions in the vector representation of each token.
  • Notation: hidden size = dd

Why it matters:

  • Larger hidden sizes can store more information but increase compute cost.
  • Smaller hidden sizes make models more efficient but less expressive.
  • Example (GPT-2 Small): Hidden size = 768, meaning each token is represented by a 768-dimensional vector at every layer.
Originally posted:
Filed Under:
ai
transformer
llm