This primer expands on the key deep learning concepts that bridge the gap between the foundational neural networks described in Michael Nielsen's book, "Neural Networks and Deep Learning," and the advanced, attention-based Transformer architecture that powers today's leading AI models.
Feedforward Neural Networks and Convolutional Neural Networks (CNNs) are powerful, but they process information in a static, one-shot manner. Each input (like an image or a row of data) is treated as a separate, independent event. This design is fundamentally unsuited for sequential data like text, speech, or time-series financial data, where order and context are everything.
Imagine trying to read a book by looking at each word in isolation. You would lose the plot, character development, and all narrative meaning. Basic networks face this same problem; they lack a mechanism for memory or understanding context over time.
To solve the memory problem, Recurrent Neural Networks (RNNs) were introduced. The core innovation is the hidden state, which is passed from one step of the sequence to the next. This hidden state acts as a compressed summary of all the information the network has seen so far.
At each step (e.g., for each word in a sentence), the RNN performs two key actions:
Analogy: Think of an RNN as a person reading a sentence. With each new word, they update their mental understanding (the hidden state) of the sentence's meaning so far. Their understanding of the word "it" in "The cat chased the mouse, and it was fast" depends entirely on the context built from previous words.
While revolutionary, simple RNNs struggle with long sequences due to the vanishing gradient problem. During training, information from early steps has to travel through many layers of recurrence to influence the output at later steps. This often causes the gradient (the signal used for learning) to shrink exponentially, becoming so small that the network effectively stops learning from earlier parts of the sequence. This is like our reader forgetting the beginning of a long paragraph by the time they reach the end.
To overcome the vanishing gradient problem, more sophisticated RNN variants were developed: Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU). These architectures introduce "gates"βspecial neural networks that regulate the flow of information through the sequence.
LSTMs maintain a separate cell state (the long-term memory) in addition to the hidden state. They use three gates to manage this memory:
GRUs are a simplified, more computationally efficient version of LSTMs. They combine the forget and input gates into a single update gate and have a reset gate to control how much past information is forgotten. They often perform just as well as LSTMs on many tasks.
Analogy: LSTMs and GRUs are like having a sophisticated note-taking system. The gates act as a smart assistant that decides when to cross out irrelevant old notes (forget gate), when to add new, important information (input gate), and which notes are relevant for the current task (output gate). This allows for much better management of long-term dependencies.
For tasks like machine translation, where an input sequence must be mapped to a different output sequence, the Sequence-to-Sequence (Seq2Seq) framework became dominant. It consists of two RNNs (usually LSTMs or GRUs):
The main weakness of this architecture is that the single context vector becomes an informational bottleneck. Forcing the encoder to cram the meaning of a long, complex sentence into one fixed-size vector is extremely difficult. Inevitably, information is lost, particularly from the beginning of the sequence.
The attention mechanism was created to solve the Seq2Seq bottleneck. Instead of a single context vector, attention allows the decoder to look back at the entire set of encoder hidden states from the input sequence at every step of its generation process.
It works by giving a different, weighted "attention score" to each input word's hidden state when generating each output word. This means the model can dynamically focus on the most relevant parts of the input sequence for the specific word it's trying to produce.
Analogy: When a human translator translates a sentence, they don't just read the source sentence once and then translate from memory. For each word they write, they might glance back and focus (pay attention to) a specific word or phrase in the original text. The attention mechanism mimics this behavior.
The 2017 paper "Attention Is All You Need" introduced the Transformer, an architecture that demonstrated that recurrence (the sequential processing of RNNs) was not necessary. It relies entirely on attention mechanisms to draw global dependencies between input and output.
Key innovations of the Transformer:
The Transformer's parallelizable design and its powerful ability to model complex relationships using self-attention made it the foundation for virtually all modern large language models (LLMs), including models like GPT-4 and BERT.