The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," revolutionized sequence-to-sequence modeling, particularly in Natural Language Processing (NLP). Prior to Transformers, recurrent neural networks (RNNs) and convolutional neural networks (CNNs) were the dominant approaches for tasks like machine translation. Transformers, however, leveraged a novel mechanism called self-attention, enabling parallelization and significantly improving performance on long-range dependencies.
At its heart, the Transformer moves away from the sequential processing inherent in RNNs. Instead, it processes all input tokens simultaneously, using an "attention" mechanism to weigh the importance of different parts of the input sequence when encoding or decoding a specific token. This parallel processing capability is a game-changer, allowing for much faster training and the handling of longer sequences more effectively.
Imagine translating a sentence like "The quick brown fox jumps over the lazy dog." With RNNs, you'd process word by word. With Transformers, all words are considered at once, and the model learns how "quick" relates to "fox," and "jumps" relates to "dog," regardless of their distance in the sentence.
The fundamental idea is that for each word in the input, the model doesn't just look at that word in isolation. It "attends" to all other words in the sentence, assigning different levels of importance to them based on their relevance to the current word's meaning. This context-aware understanding is crucial for tasks like machine translation, where the meaning of a word can heavily depend on its surrounding words.
FIGURE 1.1: Parallel Processing vs. Sequential Processing
Page 1The Transformer adheres to an encoder-decoder structure, a common pattern in sequence-to-sequence models. This modular design allows it to handle both encoding the source sequence and decoding the target sequence effectively.
Both the encoder and decoder are composed of a stack of identical layers. The original Transformer used 6 layers for both the encoder and decoder, demonstrating the power of deep stacks in this architecture.
FIGURE 2.1: High-Level Transformer Architecture
Page 2Each encoder layer (and similarly, decoder layer, with some modifications) is a sophisticated block designed to process the sequence. It fundamentally consists of two main sub-layers:
Crucially, each of these sub-layers is followed by a **residual connection** and **layer normalization**. This pattern is vital for training very deep networks like the Transformer.
Inspired by ResNets, these connections add the input of the sub-layer to its output. This creates a "shortcut" for gradients during backpropagation, helping to mitigate the vanishing gradient problem and allowing for training deeper networks more effectively. If X
is the input to a sub-layer and Sublayer(X)
is its output, the residual connection results in X + Sublayer(X)
. This ensures that the network can, at worst, learn an identity mapping, allowing subsequent layers to learn new transformations if beneficial.
Applied immediately after the residual connection, layer normalization normalizes the inputs across the features for each sample independently. This means for each word vector, its elements are normalized. This stabilizes the hidden state activations, making training faster and more stable, especially in deep networks. The formula for layer normalization for a given input vector x
(representing a single word's embedding) is:
LayerNorm(x) = γ ⋅ ( (x - μ) / σ ) + β
where μ
is the mean of the elements in x
, σ
is the standard deviation of the elements in x
, and γ
(gain) and β
(bias) are learnable scaling and shifting parameters, respectively. These parameters allow the network to re-learn optimal feature ranges if normalization proves too restrictive.
FIGURE 3.1: Structure of an Encoder Layer
Page 3Self-attention is what truly sets the Transformer apart from its predecessors. For each token in the input sequence, it calculates a weighted sum of all tokens in the sequence. The weights are dynamically determined by how "relevant" each token is to the current token. This mechanism allows the model to capture long-range dependencies efficiently and directly, overcoming the limitations of RNNs with very long sequences.
The self-attention mechanism conceptually mirrors how search engines work. For each input vector xi
(representing a word), it projects it into three different vector spaces using three distinct learned linear transformations. These transformations are represented by weight matrices: WQ
, WK
, and WV
, which are learned during training:
Q = XWQ
.K = XWK
.V = XWV
.Here, X
is the matrix of input embeddings (or outputs from the previous layer) for the entire sequence.
Qi
, corresponding to word i
), we calculate a dot product with all key vectors (Kj
, for all words j
in the sequence). This dot product measures the similarity or relevance between the current word i
and every other word j
. A higher dot product means higher relevance.Q ⋅ KT
efficiently computes these dot products for all queries against all keys simultaneously.dk
). This scaling is crucial because large dot products can lead to extremely small gradients after the softmax function, especially when dk
is large. Scaling helps to stabilize the training process.Scaled Scores = (Q ⋅ KT) / √(dk)
Attention Weights = Softmax(Scaled Scores)
Vj
by its corresponding attention weight (the softmax output) and sum them up. This weighted sum is the output of the self-attention mechanism for the current word. It's a new, context-rich representation of the word that effectively incorporates information from the entire sequence, weighted by relevance.Output = Attention Weights ⋅ V
This entire process can be elegantly summarized by the single formula:
Attention(Q, K, V) = Softmax( (Q ⋅ KT) / √(dk) ) ⋅ V
FIGURE 4.1: The Self-Attention Mechanism (Scaled Dot-Product Attention)
Page 4While single self-attention is powerful, the Transformer introduces Multi-Head Attention. Instead of performing a single attention function, the Query, Key, and Value are linearly projected h
times with different, learned linear projections. These projected versions are then fed into parallel self-attention functions.
Why "multiple heads"? Each attention head allows the model to learn different types of relationships or focus on different aspects of the input sequence. For example, one head might learn to focus on syntactic dependencies (e.g., subject-verb agreement), while another might focus on semantic relationships (e.g., coreference resolution). This enriches the model's ability to capture diverse contextual information.
h
heads, the input Query, Key, and Value matrices (Q, K, V
) are linearly projected into lower-dimensional spaces using different weight matrices (WQ,i, WK,i, WV,i
for head i
).
Qi = Q ⋅ WQ,i
Ki = K ⋅ WK,i
Vi = V ⋅ WV,i
Qi, Ki, Vi
) then goes through the standard Scaled Dot-Product Attention mechanism, yielding h
different output matrices, let's call them headi
.
headi = Attention(Qi, Ki, Vi)
h
attention heads are then concatenated together. This re-combines the diverse contextual information learned by each head.
Concat Heads = [head1; head2; ...; headh]
WO
. This projection transforms the combined attention outputs into the desired output dimension, which is typically the same as the input dimension, so it can be added to the residual connection.
MultiHead(Q, K, V) = Concat(head1, ..., headh) ⋅ WO
By allowing the model to attend to information from different representation subspaces at different positions, Multi-Head Attention significantly boosts the Transformer's capacity to learn complex relationships within the data.
FIGURE 5.1: Multi-Head Attention Mechanism
Page 5One of the key advantages of the Transformer is its parallel processing of input sequences. However, this parallelism means it inherently lacks a sense of token order or position. Unlike RNNs, which process tokens sequentially and thus implicitly encode position, the self-attention mechanism treats all tokens equally regardless of their position in the sequence.
To overcome this, the Transformer introduces Positional Encodings. These are vectors that are added to the input embeddings (and later to the decoder input embeddings) before they are fed into the first encoder (or decoder) layer. These positional encodings provide explicit information about the relative or absolute position of each token in the sequence.
pos
and dimension i
within the positional encoding vector, the values are calculated as follows:
PE(pos, 2i) = sin(pos / 100002i/dmodel)
PE(pos, 2i + 1) = cos(pos / 100002i/dmodel)
Where dmodel
is the dimensionality of the embedding space.
This design ensures that each position has a unique encoding, and the encoding for nearby positions are similar, making it easy for the network to learn relative positions.Inputfinal = Embedding(token) + PositionalEncoding(position)
Since the embeddings and positional encodings have the same dimension (dmodel
), they can be summed directly.The intuition behind using sines and cosines is that they can represent relative positions: sin(a+b) = sin(a)cos(b) + cos(a)sin(b)
. This property allows the model to easily learn to attend to relative positions (e.g., "always look 3 words before the current word") by using linear transformations of the positional encodings.
Without positional encodings, "The dog bites the man" would be indistinguishable from "The man bites the dog" in terms of word order for the attention mechanism, leading to a complete loss of meaning. Positional encodings restore this crucial sequential information.
FIGURE 6.1: Positional Encoding Process
Page 6The decoder layer shares many similarities with the encoder layer but has crucial additions to facilitate the generation of an output sequence token by token. Each decoder layer also consists of residual connections and layer normalization after its sub-layers.
A decoder layer is composed of three main sub-layers:
In the decoder, when predicting the next word in the output sequence, the model should only be allowed to attend to the words it has *already generated* (or the "start of sequence" token) and not the words it's yet to predict. To enforce this, a mask is applied to the scaled dot-product scores (before the softmax) during self-attention in the decoder.
i
can only depend on the known outputs at positions less than i
. This preserves the auto-regressive property required for sequence generation.This is a unique attention mechanism within the decoder. It's often called "cross-attention" because it attends across two different sequences:
This allows the decoder to "look at" and "attend to" the relevant parts of the entire input sequence provided by the encoder when generating each output token. For instance, when translating "The dog barks," and the decoder is generating "barks," it can attend to "dog" in the encoder's representation to ensure semantic consistency.
This cross-attention layer is crucial for linking the source and target language representations, enabling the translation process.
FIGURE 7.1: Structure of a Decoder Layer
Page 7Both the encoder and decoder layers contain a Position-wise Feed-Forward Network after the attention sub-layers. This is a relatively simple but important component that processes each position (i.e., each word's vector representation) independently and identically.
FFN(x) = max(0, xW1 + b1)W2 + b2
Here, x
is the input vector for a specific position (word). W1, b1, W2, b2
are learnable parameters.dmodel
to 4 * dmodel
), and the second linear layer contracts it back to the original dmodel
. This allows the network to apply more complex, non-linear transformations to each token's representation independently, after it has been "contextualized" by the attention mechanisms.While attention allows words to interact with each other and gather global context, the FFN allows the model to process this gathered information locally at each position. It adds non-linearity and further transformations to the contextualized embeddings, potentially learning more complex feature combinations. It can be thought of as a way to enrich the representation of each token based on the information it has "absorbed" from other tokens via the attention mechanism.
The combination of global context modeling (attention) and local, independent processing (FFN) gives the Transformer its immense power. The residual connections and layer normalization around these sub-layers are critical for stable training, especially with deeper architectures.
Applying a single, large FFN across the entire sequence would lose the individual positional processing and significantly increase the number of parameters, making the model less efficient and potentially harder to train. The position-wise approach keeps the parameter count manageable while still providing powerful transformations.
FIGURE 8.1: Position-wise Feed-Forward Network
Page 8The Transformer architecture has dominated NLP and is increasingly used in computer vision due to its powerful capabilities. However, it's essential to understand its strengths and weaknesses.
O(N2 ⋅ d)
where N
is the sequence length and d
is the embedding dimension. For very long sequences, this quadratic dependency can become prohibitive in terms of memory and computation. This has led to research into "linear attention" and sparse attention mechanisms.FIGURE 9.1: Transformer Pros and Cons
Page 9The original Transformer architecture laid a robust foundation, but it has been continuously built upon and adapted. Its core ideas have spawned an entire family of models and influenced nearly every corner of deep learning.
O(N2)
complexity for high-resolution inputs.The Transformer's ability to handle parallel computation and capture long-range dependencies effectively has made it the de facto standard for state-of-the-art results in:
The flexibility of the attention mechanism and the encoder-decoder paradigm means that Transformers are likely to continue to be at the forefront of AI research. Future work will likely focus on improving their efficiency, making them more adaptable to various data types, and integrating them into multimodal systems that understand and generate across text, images, audio, and more.
Understanding the core components – self-attention, multi-head attention, positional encodings, and the encoder-decoder structure – is essential for anyone delving into modern deep learning. The Transformer truly represents a paradigm shift in how we approach sequence modeling.
FIGURE 10.1: Transformer Family and Applications
Page 10