<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Engineer's Digest: LLM Inference Engineering]]></title><description><![CDATA[The handbook chapters, explained in plain English with real numbers.

Each post covers one module — from why LLM inference costs 100x more than traditional ML, to PagedAttention, speculative decoding, disaggregated serving, and production war stories. Written for engineers who ship models to production, not researchers who study them.

Companion repo: github.com/harshuljain13/llm-inference-at-scale]]></description><link>https://harshuljain.substack.com/s/llm-inference-engineering</link><image><url>https://substackcdn.com/image/fetch/$s_!Tssn!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a21435-8c5b-4975-9db7-20292a727543_1280x1280.png</url><title>The Engineer&apos;s Digest: LLM Inference Engineering</title><link>https://harshuljain.substack.com/s/llm-inference-engineering</link></image><generator>Substack</generator><lastBuildDate>Mon, 22 Jun 2026 17:25:46 GMT</lastBuildDate><atom:link href="https://harshuljain.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Harshul Jain & Tanya Sah]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[harshuljain@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[harshuljain@substack.com]]></itunes:email><itunes:name><![CDATA[Harshul Jain]]></itunes:name></itunes:owner><itunes:author><![CDATA[Harshul Jain]]></itunes:author><googleplay:owner><![CDATA[harshuljain@substack.com]]></googleplay:owner><googleplay:email><![CDATA[harshuljain@substack.com]]></googleplay:email><googleplay:author><![CDATA[Harshul Jain]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Issue # 1 - Why LLM Inference is Different ?]]></title><description><![CDATA[LLM Inference Engineering - The costs, bottlenecks, and physics that make LLM serving a fundamentally different problem.]]></description><link>https://harshuljain.substack.com/p/issue-1-why-llm-inference-is-different</link><guid isPermaLink="false">https://harshuljain.substack.com/p/issue-1-why-llm-inference-is-different</guid><dc:creator><![CDATA[Harshul Jain]]></dc:creator><pubDate>Thu, 28 May 2026 11:45:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ncHm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is Chapter 1 of an open-source handbook I'm writing on LLM inference at scale (<a href="http://github.com/harshuljain13/llm-inference-at-scale">GitHub</a>). If you're building serving infrastructure, this is for you.</em></p><div><hr></div><h2><strong>Introduction</strong></h2><p>When I started working on LLM inference, I assumed it would be like regular ML inference - just bigger models. I was wrong ! Everything I knew about ML inference either didn&#8217;t apply or actively misled me. The cost structure is different. The bottlenecks are different. The scaling behaviour is different. And the optimisation levers are completely different.</p><h2>The 100x Problem</h2><p>Let&#8217;s start with the number that should bother you.</p><p>Traditional ML inference models like ResNet, BERT, a recommendation model costs roughly <strong>$0.001 per request</strong>. Latency is 5&#8211;20ms. Memory usage is fixed. Batching is trivial. Scaling is linear. It&#8217;s a solved problem.</p><p>LLM inference costs <strong>$0.01&#8211;0.10 per request</strong>. Latency swings between 100ms and 10 seconds depending on output length. Memory grows during the request. Batching requires a whole new paradigm. Scaling is sub-linear and communication-bound.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ncHm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ncHm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!ncHm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!ncHm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!ncHm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ncHm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:747797,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://harshuljain.substack.com/i/199584795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ncHm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!ncHm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!ncHm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!ncHm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83f148b-9b6e-4483-b157-de8dbd70ae8e_1024x559.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p style="text-align: justify;">The gap isn&#8217;t 2x or 5x. It&#8217;s <strong>100x</strong>. And the reasons are fundamental &#8212; not something better software will fix.</p><h2>Why LLMs Are Built Differently</h2><p>The core difference comes down to one word: <strong>autoregressive</strong>.</p><p>Traditional ML inference is a single forward pass. You feed an image into ResNet, data flows through the network once, you get a classification. The time is fixed. The memory is constant. You can batch 100 requests and the cost per request drops.</p><p>LLMs work completely differently.</p><p>When you ask &#8220;What is the capital of France?&#8221;, the model doesn&#8217;t produce the full answer at once. It generates one token at a time:</p><p><strong>&#8220;The&#8221; &#8594; &#8220;capital&#8221; &#8594; &#8220;of&#8221; &#8594; &#8220;France&#8221; &#8594; &#8220;is&#8221; &#8594; &#8220;Paris&#8221;</strong></p><p><strong>Each token is a separate forward pass through the entire model</strong>. Token 5 cannot be generated until tokens 1&#8211;4 exist. That&#8217;s not an engineering limitation to be solved &#8212; it&#8217;s how autoregressive language models work by design. The probability distribution for each token depends on every token that came before it.</p><h2>The Weight Reading Problem</h2><p>Here&#8217;s the number that made everything click for me.</p><p>Every time the model generates a token, it needs to run a full forward pass. A full forward pass means reading <strong>all</strong> of the model&#8217;s parameters from GPU memory into the compute units. The <strong>GPU doesn&#8217;t &#8220;remember&#8221; its weights between operations,</strong> every matrix multiply requires loading the weight matrix fresh from memory.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bjk7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90070243-43f2-486a-b247-9dc3c10a376d_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bjk7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90070243-43f2-486a-b247-9dc3c10a376d_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!Bjk7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90070243-43f2-486a-b247-9dc3c10a376d_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!Bjk7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90070243-43f2-486a-b247-9dc3c10a376d_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!Bjk7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90070243-43f2-486a-b247-9dc3c10a376d_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bjk7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90070243-43f2-486a-b247-9dc3c10a376d_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/90070243-43f2-486a-b247-9dc3c10a376d_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:806461,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://harshuljain.substack.com/i/199584795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90070243-43f2-486a-b247-9dc3c10a376d_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bjk7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90070243-43f2-486a-b247-9dc3c10a376d_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!Bjk7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90070243-43f2-486a-b247-9dc3c10a376d_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!Bjk7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90070243-43f2-486a-b247-9dc3c10a376d_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!Bjk7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90070243-43f2-486a-b247-9dc3c10a376d_1024x559.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For Llama 3.1 8B generating 100 tokens, that looks like this:</p><ul><li><p>Token 1 &#8594; read 16 GB of weights</p></li><li><p>Token 2 &#8594; read 16 GB of weights again</p></li><li><p>Token 3 &#8594; read 16 GB of weights again</p></li><li><p>&#8230; and so on for every token</p></li><li><p><strong>Total memory reads: 16 GB &#215; 100 = 1.6 TB</strong></p></li></ul><p>And this leads directly to a hard physical ceiling that no software can break.</p><h2>The Memory Bandwidth Wall</h2><p>An A100 GPU has 2 TB/s of memory bandwidth. Llama 8B in FP16 weighs 16 GB. So the minimum time to generate one token is:</p><p><strong>16 GB &#247; 2 TB/s = 8ms &#8594; maximum 125 tokens/second</strong></p><p>That&#8217;s a hard ceiling. Not a guideline. Not something you can optimise past with clever code.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eXrp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c918d99-ffbb-40d9-b965-3dbf49977dce_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eXrp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c918d99-ffbb-40d9-b965-3dbf49977dce_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!eXrp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c918d99-ffbb-40d9-b965-3dbf49977dce_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!eXrp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c918d99-ffbb-40d9-b965-3dbf49977dce_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!eXrp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c918d99-ffbb-40d9-b965-3dbf49977dce_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eXrp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c918d99-ffbb-40d9-b965-3dbf49977dce_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c918d99-ffbb-40d9-b965-3dbf49977dce_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:677504,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://harshuljain.substack.com/i/199584795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c918d99-ffbb-40d9-b965-3dbf49977dce_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eXrp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c918d99-ffbb-40d9-b965-3dbf49977dce_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!eXrp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c918d99-ffbb-40d9-b965-3dbf49977dce_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!eXrp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c918d99-ffbb-40d9-b965-3dbf49977dce_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!eXrp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c918d99-ffbb-40d9-b965-3dbf49977dce_1024x559.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The only ways around this wall are:</p><ul><li><p><strong>Quantization</strong> : smaller weights mean fewer bytes to read</p></li><li><p><strong>Better hardware</strong> : more memory bandwidth, not more compute</p></li><li><p><strong>Speculative decoding</strong> : generate multiple tokens per weight read, breaking the sequential bottleneck</p></li></ul><p>Notice what&#8217;s not on that list: faster software, smarter scheduling, more efficient attention. Those help with other things, but they don&#8217;t move this ceiling.</p><h2>Two Phases, Two Completely Different Problems</h2><p>Every LLM request goes through two distinct stages, and they could not be more different.</p><p><strong>Prefill</strong> is when the model processes your entire prompt. All tokens are processed in parallel in a single forward pass. This phase is <strong>compute-bound</strong> : the GPU is doing massive parallel matrix multiplications and its cores are fully utilised. A 1000-token prompt takes roughly 50ms.</p><p><strong>Decode</strong> is when the model generates your response, one token at a time. Each token is its own forward pass. This phase is <strong>memory-bound</strong> : the GPU spends most of its time waiting for data to arrive from memory, not actually computing.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W0dO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7384cd10-4050-413d-bf6d-1cce7ff1120c_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W0dO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7384cd10-4050-413d-bf6d-1cce7ff1120c_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!W0dO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7384cd10-4050-413d-bf6d-1cce7ff1120c_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!W0dO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7384cd10-4050-413d-bf6d-1cce7ff1120c_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!W0dO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7384cd10-4050-413d-bf6d-1cce7ff1120c_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W0dO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7384cd10-4050-413d-bf6d-1cce7ff1120c_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7384cd10-4050-413d-bf6d-1cce7ff1120c_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:744884,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://harshuljain.substack.com/i/199584795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7384cd10-4050-413d-bf6d-1cce7ff1120c_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W0dO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7384cd10-4050-413d-bf6d-1cce7ff1120c_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!W0dO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7384cd10-4050-413d-bf6d-1cce7ff1120c_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!W0dO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7384cd10-4050-413d-bf6d-1cce7ff1120c_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!W0dO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7384cd10-4050-413d-bf6d-1cce7ff1120c_1024x559.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here&#8217;s how idle the GPU actually is during decode:</p><ul><li><p>What the A100 can do: <strong>312 TFLOPS</strong></p></li><li><p>What it actually does during decode: <strong>~16 GFLOPS</strong></p></li><li><p>GPU utilisation during decode: <strong>16 &#247; 312,000 &#8776; 0.005%</strong></p></li></ul><p><strong>The GPU is 99.995% idle while generating your tokens.</strong></p><p>This is why throwing a faster GPU at a decode-bound system barely helps, memory bandwidth scales much slower than raw compute between GPU generations.</p><p>Here&#8217;s the part that surprises most engineers the first time they see it.</p><p>Take a request with a 1000-token prompt generating 100 tokens of output:</p><p><strong>Prefill:</strong> process 1000 tokens in one pass &#8594; ~50ms</p><p><strong>Decode:</strong> 100 passes &#215; 8ms each &#8594; ~800ms</p><p><strong>Total: 850ms. Decode is 94% of the time despite processing 10x fewer tokens.</strong></p><p>Decode dominates even though it does less work. This single asymmetry drives almost every architectural decision in LLM inference : continuous batching, disaggregated serving, speculative decoding, KV cache management. All of it traces back to this.</p><h2>Why This Changes How You Think About Optimization</h2><p>Once you internalize these two constraints, decisions that seemed arbitrary start making sense.</p><p><strong>Batching helps decode</strong> because processing multiple requests together amortises the weight reads across all of them, increasing arithmetic intensity and moving you toward better GPU utilization.</p><p><strong>Quantization helps</strong> because INT8 weights are half the size of FP16 weights - fewer bytes to read per token, faster decode.</p><p><strong>Speculative decoding</strong> is clever because a small draft model proposes multiple tokens, which the large model verifies in parallel - effectively generating multiple tokens per weight read.</p><p><strong>Disaggregated serving</strong> exists because prefill and decode have completely different hardware profiles. Running them on the same GPU means compromising on both.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WCm6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F024cefe1-7adc-4690-ba62-1acfe2d00f2d_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WCm6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F024cefe1-7adc-4690-ba62-1acfe2d00f2d_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!WCm6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F024cefe1-7adc-4690-ba62-1acfe2d00f2d_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!WCm6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F024cefe1-7adc-4690-ba62-1acfe2d00f2d_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!WCm6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F024cefe1-7adc-4690-ba62-1acfe2d00f2d_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WCm6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F024cefe1-7adc-4690-ba62-1acfe2d00f2d_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/024cefe1-7adc-4690-ba62-1acfe2d00f2d_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:710075,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://harshuljain.substack.com/i/199584795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F024cefe1-7adc-4690-ba62-1acfe2d00f2d_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WCm6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F024cefe1-7adc-4690-ba62-1acfe2d00f2d_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!WCm6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F024cefe1-7adc-4690-ba62-1acfe2d00f2d_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!WCm6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F024cefe1-7adc-4690-ba62-1acfe2d00f2d_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!WCm6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F024cefe1-7adc-4690-ba62-1acfe2d00f2d_1024x559.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>All roads lead back to the same place: <strong>memory bandwidth is the wall, and every technique is either working within it or around it.</strong></p><p></p><h2>What&#8217;s Next</h2><p>This was the foundation. Next chapter goes one level deeper.</p><p><strong>Module 0.2: Transformer Inference Mechanics</strong> &#8212; a byte-level walkthrough of how attention actually works during inference, the KV cache math, why grouped query attention exists, and concrete memory access patterns with real numbers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!87Cp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48446011-4910-4841-9ede-b6a4a4a861c8_1024x444.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!87Cp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48446011-4910-4841-9ede-b6a4a4a861c8_1024x444.png 424w, https://substackcdn.com/image/fetch/$s_!87Cp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48446011-4910-4841-9ede-b6a4a4a861c8_1024x444.png 848w, https://substackcdn.com/image/fetch/$s_!87Cp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48446011-4910-4841-9ede-b6a4a4a861c8_1024x444.png 1272w, https://substackcdn.com/image/fetch/$s_!87Cp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48446011-4910-4841-9ede-b6a4a4a861c8_1024x444.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!87Cp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48446011-4910-4841-9ede-b6a4a4a861c8_1024x444.png" width="1024" height="444" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/48446011-4910-4841-9ede-b6a4a4a861c8_1024x444.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:444,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:748086,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://harshuljain.substack.com/i/199584795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd74733-66c3-4922-b4e2-182e6aef7538_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!87Cp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48446011-4910-4841-9ede-b6a4a4a861c8_1024x444.png 424w, https://substackcdn.com/image/fetch/$s_!87Cp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48446011-4910-4841-9ede-b6a4a4a861c8_1024x444.png 848w, https://substackcdn.com/image/fetch/$s_!87Cp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48446011-4910-4841-9ede-b6a4a4a861c8_1024x444.png 1272w, https://substackcdn.com/image/fetch/$s_!87Cp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48446011-4910-4841-9ede-b6a4a4a861c8_1024x444.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p><em>I&#8217;m writing this handbook in public at <a href="http://github.com/harshuljain13/llm-inference-at-scale">github.com/harshuljain13/llm-inference-at-scale</a>. If something is wrong, open an issue. If it&#8217;s useful, a &#11088; helps others find it.</em></p><p><em>If a colleague is building LLM serving infrastructure, forward this their way.</em></p>]]></content:encoded></item></channel></rss>