GQA

Sun, 29 Mar 2026 00:00:00 +0000

The point that nanochat use GQA

Title: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Authors: Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai
Venue: EMNLP 2023
URL: https://arxiv.org/pdf/2305.13245
Length: 7 pages

Concept of GPU memory bandwidth.
- What’s the effect of a large memory width and small memory bandwidth
Computational complexity of Attention calculation
KV-cache
Matrix multiplication with expand
Computational efficient()

The memory bandwidth bottleneck has quite huge adverse-effect on autoregressive decoding process like GPT rather than the encoder like BERT.