<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Winterstar</title><link>https://winterstar67.github.io/</link><description>Recent content on Winterstar</description><generator>Hugo</generator><language>en</language><lastBuildDate>Tue, 07 Apr 2026 21:08:18 +0900</lastBuildDate><atom:link href="https://winterstar67.github.io/index.xml" rel="self" type="application/rss+xml"/><item><title>GQA</title><link>https://winterstar67.github.io/posts/gqa/</link><pubDate>Sun, 29 Mar 2026 00:00:00 +0000</pubDate><guid>https://winterstar67.github.io/posts/gqa/</guid><description>&lt;h2 id="the-point-that-nanochat-use-gqa"&gt;The point that nanochat use GQA&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;GQA is in the &lt;code&gt;CausalSelfAttention.forward()&lt;/code&gt; in &lt;code&gt;gpt.py&lt;/code&gt; file&lt;/li&gt;
&lt;li&gt;The reasong of applying GQA is to make training and inference faster than MHA&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="paper-info"&gt;Paper info&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Title&lt;/strong&gt;: &lt;em&gt;GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authors&lt;/strong&gt;: Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Venue&lt;/strong&gt;: EMNLP 2023&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;URL&lt;/strong&gt;: &lt;a href="https://arxiv.org/pdf/2305.13245"&gt;https://arxiv.org/pdf/2305.13245&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Length&lt;/strong&gt;: 7 pages&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="0-background-knowledge-to-know"&gt;0. Background Knowledge to know&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Concept of GPU memory bandwidth.
&lt;ul&gt;
&lt;li&gt;What&amp;rsquo;s the effect of a large memory width and small memory bandwidth&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Computational complexity of Attention calculation&lt;/li&gt;
&lt;li&gt;KV-cache&lt;/li&gt;
&lt;li&gt;Matrix multiplication with expand&lt;/li&gt;
&lt;li&gt;Computational efficient()&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="1-motivation"&gt;1. Motivation&lt;/h2&gt;
&lt;p&gt;The memory bandwidth bottleneck has quite huge adverse-effect on autoregressive decoding process like GPT rather than the encoder like BERT.&lt;/p&gt;</description></item></channel></rss>