Attention Is All You Need
Vaswani et al.
“The paper that sparked the LLM revolution. Essential reading for understanding how modern language models are built.”
“We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”
Self-attention lets every position in a sequence attend to every other position in a single step — unlike RNNs which must pass information through a chain of steps. This is the key insight that makes Transformers both faster to train and better at capturing long-range dependencies.
Multi-head attention runs h parallel attention functions on projected subspaces, then concatenates the results. Different heads learn to attend to different types of relationships simultaneously — one head might track syntactic structure while another tracks semantic similarity.