Visualizing Transformers and Attention#
This is the summary note from Grant Sanderson’s talk at TNG Big Tech 2024. My earlir article on transformers can be found here
Transformers and Their Flexibility#
- ๐ Origin: Introduced in 2017 in the “Attention is All You Need” paper, originally for machine translation.
- ๐ Applications Beyond Translation: Used in transcription (e.g., Whisper), text-to-speech, and even image classification.
- ๐ค Chatbot Models: Focused on models trained to predict the next token in a sequence, generating text iteratively one token at a time.
Next Token Prediction and Creativity#
- ๐ฎ Prediction Process: Predicts probabilities for possible next tokens, selects one, and repeats the process.
- ๐ก๏ธ Temperature Control: Adjusting randomness in token selection affects creativity vs. predictability in outputs.
Tokens and Tokenization#
- ๐งฉ What are Tokens? Subdivisions of input data (words, subwords, punctuation, or image patches).
- ๐ก Why Not Characters? Using characters increases context size and computational complexity; tokens balance meaning and computational efficiency.
- ๐ Byte Pair Encoding (BPE): A common method for tokenization.
Embedding Tokens into Vectors#
- ๐ Embedding: Tokens are mapped to high-dimensional vectors representing their meaning.
- ๐บ๏ธ Contextual Meaning: Vectors evolve through the network to capture context, disambiguate meaning, and encode relationships.
The Attention Mechanism#
- ๐ Purpose: Enables tokens to “attend” to others, updating their vectors based on relevance.
- ๐ Key Components:
- Query Matrix: Encodes what a token is “looking for.”
- Key Matrix: Encodes how a token responds to queries.
- Value Matrix: Encodes information passed between tokens.
- ๐งฎ Calculations:
- Dot Product: Measures alignment between keys and queries.
- Softmax: Converts dot products into normalized weights for updates.
- โ๏ธ Masked Attention: Ensures causality by blocking future tokens from influencing past ones.
Multi-Headed Attention#
- ๐ก Parallel Heads: Multiple attention heads allow different types of relationships (e.g., grammar, semantic context) to be processed simultaneously.
- ๐ Efficiency on GPUs: Designed to maximize parallelization for faster computation.
Multi-Layer Perceptrons (MLPs)#
- ๐ค Role in Transformers:
- Add capacity for general knowledge and non-contextual reasoning.
- Store facts learned during training, e.g., associations like “Michael Jordan plays basketball.”
- ๐ข Parameters: MLPs hold the majority of the modelโs parameters.
Training Transformers#
- ๐ Learning Framework:
- Models are trained on vast datasets using next-token prediction, requiring no manual labels.
- Cost Function: Measures prediction accuracy using negative log probabilities, guiding parameter updates.
- ๐๏ธ Optimization: Gradient descent navigates a high-dimensional cost surface to minimize error.
- ๐ Pretraining: Allows large-scale unsupervised learning before fine-tuning with human feedback.
Embedding Space and High Dimensions#
- ๐ Semantic Clusters: Similar words cluster together; directions in the space encode relationships (e.g., gender: King - Male + Female = Queen).
- ๐ High Dimensionality: Embedding spaces have thousands of dimensions, enabling distinct representations of complex concepts.
- ๐ Scaling Efficiency: High-dimensional spaces allow exponentially more “almost orthogonal” directions for encoding meanings.
Practical Applications#
- โ๏ธ Language Models: Effective for chatbots, summarization, and more due to their generality and parallel processing.
- ๐ผ๏ธ Multimodal Models: Transformers can integrate text, images, and sound by treating all as tokens in a unified framework.
Challenges and Limitations#
- ๐ Context Size Limitations: Attention grows quadratically with context size, requiring optimization for large contexts.
- โป๏ธ Inference Redundancy: Token-by-token generation can involve redundant computations; caching mitigates this at inference time.
Engineering and Design#
- ๐ ๏ธ Hardware Optimization: Transformers are designed to exploit GPUs’ parallelism for efficient matrix multiplication.
- ๐ Residual Connections: Baked into the architecture to enhance stability and ease of training.
The Power of Scale#
- ๐ Scaling Laws: Larger models and more data improve performance, often qualitatively.
- ๐ Self-Supervised Pretraining: Enables training on vast unlabeled datasets before fine-tuning.
BPE (Byte Pair Encoding)#
BPE is a widely used tokenization method in natural language processing (NLP) and machine learning. It is designed to balance between breaking text into characters and full words by representing text as a sequence of subword units. This approach helps models handle rare and unseen words effectively while keeping the vocabulary size manageable.
How BPE Works:#
Start with Characters:
- Initially, every character in the text is treated as a separate token.
Merge Frequent Pairs:
- BPE repeatedly identifies the most frequent pair of adjacent tokens in the training corpus and merges them into a single token. This process is iteratively applied.
- For example:
- Input:
low
,lower
,lowest
- Output Vocabulary: {low_, e, r, s, t}
- Input:
Build Vocabulary:
- The merging process stops after a predefined number of merges, resulting in a vocabulary of subwords, characters, and some common full words.
Comments: