Paper

Attention Is All You Need

Vaswani et al.

Machine LearningNLPTransformers

Feb 2025

“The paper that sparked the LLM revolution. Essential reading for understanding how modern language models are built.”

Quote·Abstract

“We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”

Takeaway

Self-attention lets every position in a sequence attend to every other position in a single step — unlike RNNs which must pass information through a chain of steps. This is the key insight that makes Transformers both faster to train and better at capturing long-range dependencies.

Note·§3.2 — Multi-Head Attention

Multi-head attention runs h parallel attention functions on projected subspaces, then concatenates the results. Different heads learn to attend to different types of relationships simultaneously — one head might track syntactic structure while another tracks semantic similarity.

Back to Reads