From Neural Networks to Transformers: An Expanded Primer

This primer expands on the key deep learning concepts that bridge the gap between the foundational neural networks described in Michael Nielsen's book, "Neural Networks and Deep Learning," and the advanced, attention-based Transformer architecture that powers today's leading AI models.

The Starting Point: Limitations of Basic Networks for Sequences

Feedforward Neural Networks and Convolutional Neural Networks (CNNs) are powerful, but they process information in a static, one-shot manner. Each input (like an image or a row of data) is treated as a separate, independent event. This design is fundamentally unsuited for sequential data like text, speech, or time-series financial data, where order and context are everything.

Imagine trying to read a book by looking at each word in isolation. You would lose the plot, character development, and all narrative meaning. Basic networks face this same problem; they lack a mechanism for memory or understanding context over time.

Step 1: Recurrent Neural Networks (RNNs) — Introducing Memory 🔄

To solve the memory problem, Recurrent Neural Networks (RNNs) were introduced. The core innovation is the hidden state, which is passed from one step of the sequence to the next. This hidden state acts as a compressed summary of all the information the network has seen so far.

At each step (e.g., for each word in a sentence), the RNN performs two key actions:

It produces an output for the current step.
It updates its hidden state by combining the current input with the previous hidden state.

Analogy: Think of an RNN as a person reading a sentence. With each new word, they update their mental understanding (the hidden state) of the sentence's meaning so far. Their understanding of the word "it" in "The cat chased the mouse, and it was fast" depends entirely on the context built from previous words.

Diagram of an unrolled Recurrent Neural Network

The Sticking Point: The Vanishing Gradient Problem

While revolutionary, simple RNNs struggle with long sequences due to the vanishing gradient problem. During training, information from early steps has to travel through many layers of recurrence to influence the output at later steps. This often causes the gradient (the signal used for learning) to shrink exponentially, becoming so small that the network effectively stops learning from earlier parts of the sequence. This is like our reader forgetting the beginning of a long paragraph by the time they reach the end.

Step 2: LSTMs & GRUs — Smarter Memory with Gates 🚪

To overcome the vanishing gradient problem, more sophisticated RNN variants were developed: Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU). These architectures introduce "gates"—special neural networks that regulate the flow of information through the sequence.

Long Short-Term Memory (LSTM)

LSTMs maintain a separate cell state (the long-term memory) in addition to the hidden state. They use three gates to manage this memory:

Forget Gate: Decides what information from the cell state should be thrown away or forgotten.
Input Gate: Decides which new information from the current input should be stored in the cell state.
Output Gate: Determines what part of the cell state should be used to produce the output for the current time step.

Gated Recurrent Unit (GRU)

GRUs are a simplified, more computationally efficient version of LSTMs. They combine the forget and input gates into a single update gate and have a reset gate to control how much past information is forgotten. They often perform just as well as LSTMs on many tasks.

Analogy: LSTMs and GRUs are like having a sophisticated note-taking system. The gates act as a smart assistant that decides when to cross out irrelevant old notes (forget gate), when to add new, important information (input gate), and which notes are relevant for the current task (output gate). This allows for much better management of long-term dependencies.

Step 3: The Encoder-Decoder (Seq2Seq) Model — Translating Sequences 🗣️

For tasks like machine translation, where an input sequence must be mapped to a different output sequence, the Sequence-to-Sequence (Seq2Seq) framework became dominant. It consists of two RNNs (usually LSTMs or GRUs):

The Encoder: Reads the entire input sequence (e.g., a sentence in English) and compresses it into a single, fixed-length vector called the context vector or "thought vector." This vector aims to capture the entire meaning of the input sequence.
The Decoder: Takes the context vector and generates the output sequence (e.g., the translated sentence in French), one word at a time, using the context to guide its predictions.

The Encoder-Decoder architecture of a Seq2Seq model

The Bottleneck Problem

The main weakness of this architecture is that the single context vector becomes an informational bottleneck. Forcing the encoder to cram the meaning of a long, complex sentence into one fixed-size vector is extremely difficult. Inevitably, information is lost, particularly from the beginning of the sequence.

Step 4: The Attention Mechanism — Focusing on What's Important 👀

The attention mechanism was created to solve the Seq2Seq bottleneck. Instead of a single context vector, attention allows the decoder to look back at the entire set of encoder hidden states from the input sequence at every step of its generation process.

It works by giving a different, weighted "attention score" to each input word's hidden state when generating each output word. This means the model can dynamically focus on the most relevant parts of the input sequence for the specific word it's trying to produce.

Analogy: When a human translator translates a sentence, they don't just read the source sentence once and then translate from memory. For each word they write, they might glance back and focus (pay attention to) a specific word or phrase in the original text. The attention mechanism mimics this behavior.

Visualization of the attention mechanism in machine translation

The Final Leap: The Transformer — "Attention Is All You Need" 🚀

The 2017 paper "Attention Is All You Need" introduced the Transformer, an architecture that demonstrated that recurrence (the sequential processing of RNNs) was not necessary. It relies entirely on attention mechanisms to draw global dependencies between input and output.

Key innovations of the Transformer:

No More Recurrence: By removing RNNs, the Transformer can process all words in a sequence simultaneously (in parallel). This makes training vastly faster and more efficient.
Self-Attention: This is the core of the Transformer. The model uses attention on the input sequence itself to weigh the importance of every other word in the sequence when encoding a specific word. This allows it to build incredibly rich, context-aware representations. For example, in the sentence "The animal didn't cross the street because it was too tired," self-attention helps the model learn that "it" refers to "the animal" and not "the street."
Multi-Head Attention: This is an enhancement of self-attention. Instead of calculating attention once, it does so multiple times in parallel (in different "heads") and then combines the results. This allows the model to jointly attend to information from different representation subspaces at different positions. It's like having multiple people read a sentence, each focusing on a different type of relationship (e.g., one on subject-verb, another on pronoun references).
Positional Encodings: Since there's no recurrence, the model has no inherent sense of word order. To solve this, a positional encoding vector is added to each word's embedding to give the model information about its position in the sequence.

The Transformer's parallelizable design and its powerful ability to model complex relationships using self-attention made it the foundation for virtually all modern large language models (LLMs), including models like GPT-4 and BERT.

From Foundational Networks to Transformers: A Conceptual Journey 🧠