<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[AI Engineering Unpacked]]></title><description><![CDATA[All you need to know to build practical AI applications]]></description><link>https://www.aiunpacked.net</link><image><url>https://substackcdn.com/image/fetch/$s_!t2NK!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c08ffd-a9d2-4665-9b4a-0a674ad12c4b_1024x1024.png</url><title>AI Engineering Unpacked</title><link>https://www.aiunpacked.net</link></image><generator>Substack</generator><lastBuildDate>Tue, 14 Apr 2026 03:05:33 GMT</lastBuildDate><atom:link href="https://www.aiunpacked.net/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Maxym Muzychenko]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[aiengineeringunpacked@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[aiengineeringunpacked@substack.com]]></itunes:email><itunes:name><![CDATA[Max]]></itunes:name></itunes:owner><itunes:author><![CDATA[Max]]></itunes:author><googleplay:owner><![CDATA[aiengineeringunpacked@substack.com]]></googleplay:owner><googleplay:email><![CDATA[aiengineeringunpacked@substack.com]]></googleplay:email><googleplay:author><![CDATA[Max]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Quantization Made Simple: How to Run Big Models on Small Hardware?]]></title><description><![CDATA[Learn what quantization is and how it works]]></description><link>https://www.aiunpacked.net/p/quantization-made-simple-how-to-run</link><guid isPermaLink="false">https://www.aiunpacked.net/p/quantization-made-simple-how-to-run</guid><dc:creator><![CDATA[Max]]></dc:creator><pubDate>Tue, 28 Oct 2025 13:07:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!i-Ri!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When I worked in the healthcare domain, we faced a problem that probably sounds familiar to many of you. We needed to deploy a Large Language Model (LLM), but because of data privacy, everything had to stay on our client&#8217;s hardware. No cloud APIs. No external servers. Just us and their single GPU with 16GB of memory. Our specialized LLM had 8 billion parameters. The math was simple and brutal. <strong>It wouldn&#8217;t fit.</strong></p><p>Through a technique called <em>quantization</em>, we managed to run that model smoothly on hardware that should have been too small. This post will help you understand what makes LLMs so demanding on memory, what quantization actually does to solve this problem, and how it manages to <strong>shrink models without breaking them</strong>. So let&#8217;s get into it!</p><div class="pullquote"><p>Before continuing, take a look at this article to get a better understanding of how LLMs work.</p><p><a href="https://www.aiunpacked.net/p/large-language-models-explained">Large Language Models Explained</a></p></div><h2>Why You Should Care About This</h2><p>LLMs are getting absurdly large. Some models now have hundreds of billions of parameters, with the largest reaching into the trillions. Even the &#8220;small&#8221; 7-billion-parameter models need significant hardware to run. This creates real problems! Renting GPUs with enough memory <strong>gets expensive fast</strong>. Not everyone can or wants to use cloud APIs. Developers want to run models locally on their laptops. Like our healthcare case, <strong>some data simply cannot leave the building</strong> due to privacy requirements.</p><p><strong>Quantization</strong> offers a <strong>solution</strong>. This technique can cut your memory requirements in half or even to a quarter with barely any performance loss. That 16GB model can run on 8GB, sometimes even 4GB.</p><div class="pullquote"><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aiunpacked.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">If you&#8217;re interested in learning more about LLMs, subscribe for free to get practical guides like this every week</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div></div><h2>The Big Picture: What is Quantization?</h2><p>Before we dive into the mechanics, let me give you an intuitive understanding of what we&#8217;re trying to achieve. Think about photos on your phone. You could store every picture in maximum quality RAW format, but that&#8217;s impractical. Instead, your phone compresses them to JPG. The files are 10x smaller, yet <strong>you barely notice the difference</strong> when viewing them.</p><p>Quantization does the same thing for LLMs. In simple terms:</p><blockquote><p><em>Quantization reduces the precision of the numbers that make up your model, making it smaller while maintaining its performance.</em></p></blockquote><p>It&#8217;s a compression technique, but instead of compressing pixels, we&#8217;re compressing the mathematical weights that power the model.</p><h2>How Numbers Work in LLMs</h2><p>To understand quantization, you need to know just one thing. Everything in a neural network comes down to numbers, billions of them. These numbers are called parameters or weights, and they represent what the model learned during training. They determine how the model processes your input and generates output. Each number is stored in computer memory using bits, which are just 0s and 1s. The <strong>more bits</strong> you use, the <strong>more precise</strong> the number becomes, but it also consumes <strong>more memory</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1Yp-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d8f5e7-a92d-4717-ada2-44aabdc9a5fb_1606x652.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Yp-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d8f5e7-a92d-4717-ada2-44aabdc9a5fb_1606x652.png 424w, https://substackcdn.com/image/fetch/$s_!1Yp-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d8f5e7-a92d-4717-ada2-44aabdc9a5fb_1606x652.png 848w, https://substackcdn.com/image/fetch/$s_!1Yp-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d8f5e7-a92d-4717-ada2-44aabdc9a5fb_1606x652.png 1272w, https://substackcdn.com/image/fetch/$s_!1Yp-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d8f5e7-a92d-4717-ada2-44aabdc9a5fb_1606x652.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Yp-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d8f5e7-a92d-4717-ada2-44aabdc9a5fb_1606x652.png" width="1456" height="591" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7d8f5e7-a92d-4717-ada2-44aabdc9a5fb_1606x652.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:591,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:503288,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/174493254?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d8f5e7-a92d-4717-ada2-44aabdc9a5fb_1606x652.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1Yp-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d8f5e7-a92d-4717-ada2-44aabdc9a5fb_1606x652.png 424w, https://substackcdn.com/image/fetch/$s_!1Yp-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d8f5e7-a92d-4717-ada2-44aabdc9a5fb_1606x652.png 848w, https://substackcdn.com/image/fetch/$s_!1Yp-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d8f5e7-a92d-4717-ada2-44aabdc9a5fb_1606x652.png 1272w, https://substackcdn.com/image/fetch/$s_!1Yp-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d8f5e7-a92d-4717-ada2-44aabdc9a5fb_1606x652.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Modern LLMs typically use three different precision levels. The first is 16 bits (2 bytes), which is the standard training precision for most models. The second is 8 bits (1 byte), which is a common quantization target that provides <strong><a href="https://arxiv.org/abs/2211.10438?utm_source=chatgpt.com">50% memory reduction </a></strong><a href="https://arxiv.org/abs/2211.10438?utm_source=chatgpt.com">and </a><strong><a href="https://arxiv.org/abs/2211.10438?utm_source=chatgpt.com">1.56x speed up</a></strong>. The third is 4 bits (0.5 bytes), which is a more aggressive quantization that provides <strong>75% memory reduction</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i-Ri!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i-Ri!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png 424w, https://substackcdn.com/image/fetch/$s_!i-Ri!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png 848w, https://substackcdn.com/image/fetch/$s_!i-Ri!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png 1272w, https://substackcdn.com/image/fetch/$s_!i-Ri!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i-Ri!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png" width="1456" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:407946,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/174493254?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!i-Ri!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png 424w, https://substackcdn.com/image/fetch/$s_!i-Ri!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png 848w, https://substackcdn.com/image/fetch/$s_!i-Ri!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png 1272w, https://substackcdn.com/image/fetch/$s_!i-Ri!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c807f5e-a5fd-430a-86f6-ace988c461e7_1510x770.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Number &#8220;33&#8221; represented with 8 bits</figcaption></figure></div><h2>The Memory Math Made Simple</h2><p>To understand how much memory is required to run an LLM, this is <strong>the most important formula</strong> you&#8217;ll need.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KadH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ad64e8-f615-4d59-900e-179810d42f77_1743x364.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KadH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ad64e8-f615-4d59-900e-179810d42f77_1743x364.png 424w, https://substackcdn.com/image/fetch/$s_!KadH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ad64e8-f615-4d59-900e-179810d42f77_1743x364.png 848w, https://substackcdn.com/image/fetch/$s_!KadH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ad64e8-f615-4d59-900e-179810d42f77_1743x364.png 1272w, https://substackcdn.com/image/fetch/$s_!KadH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ad64e8-f615-4d59-900e-179810d42f77_1743x364.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KadH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ad64e8-f615-4d59-900e-179810d42f77_1743x364.png" width="728" height="152.0321285140562" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f4ad64e8-f615-4d59-900e-179810d42f77_1743x364.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1743,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:47792,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/174493254?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F987e646e-2097-4ddd-ba45-686e588f4e80_1858x364.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KadH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ad64e8-f615-4d59-900e-179810d42f77_1743x364.png 424w, https://substackcdn.com/image/fetch/$s_!KadH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ad64e8-f615-4d59-900e-179810d42f77_1743x364.png 848w, https://substackcdn.com/image/fetch/$s_!KadH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ad64e8-f615-4d59-900e-179810d42f77_1743x364.png 1272w, https://substackcdn.com/image/fetch/$s_!KadH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ad64e8-f615-4d59-900e-179810d42f77_1743x364.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Let&#8217;s apply this to a real example with Llama 2 7B. With 16-bit precision, you need 7 billion parameters multiplied by 2 bytes, which equals 14 GB. With 8-bit quantization, you need 7 billion parameters multiplied by 1 byte, which equals 7 GB. With 4-bit quantization, you just need 3.5 GB. <strong>Same model</strong>, drastically <strong>different memory footprint</strong>.</p><p>During inference, when the model is generating text, you need extra memory for something called KV-cache. This cache stores context from the conversation.</p><blockquote><p><em>The amount of extra memory depends on size of your context window.</em></p></blockquote><p>Larger context windows, like 8K or 32K tokens, need significantly more memory than smaller ones like 2K or 4K tokens. For a 7B model in 8-bit with a typical 4K context window, you should plan for around 9GB of VRAM. If you&#8217;re tight on VRAM, you can reduce the context window to make the model fit.</p><h2>How Quantization Actually Works</h2><p>Now let&#8217;s peek under the hood and see what&#8217;s actually happening when we quantize a model. I promise to keep it simple, but understanding this will help you make better decisions about when and how to use quantization.</p><blockquote><p><em>The core idea is that we&#8217;re mapping high-precision numbers to low-precision numbers.</em></p></blockquote><p>Imagine you have a thermometer that measures temperature to the tenth of a degree, showing readings like 68.4&#176;F, 68.8&#176;F, and 69.9&#176;F. Quantization is like switching to a thermometer that only shows whole numbers like 68&#176;F, 69&#176;F, and 70&#176;F. You lose some detail, but you still get useful information.</p><h2>A Simple Example</h2><p>Let me show you how this works with a concrete example. Let&#8217;s say we want to quantize the number 33 from 8-bit to 4-bit representation.</p><p>In 8-bit space, numbers range from -128 to 127, giving us <strong>256 possible values</strong>. In 4-bit space, numbers range from -8 to 7, giving us only <strong>16 possible values</strong>. To convert between them, we need a scale factor.</p><p>The scale factor is calculated by dividing 256 by 16, which gives us 16. Now we can quantize our number. We take 33 and divide it by 16, which gives us 2.0625. After rounding, we get 2.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g71i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8adccf-4598-41e9-b490-b9c000303134_1388x576.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g71i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8adccf-4598-41e9-b490-b9c000303134_1388x576.png 424w, https://substackcdn.com/image/fetch/$s_!g71i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8adccf-4598-41e9-b490-b9c000303134_1388x576.png 848w, https://substackcdn.com/image/fetch/$s_!g71i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8adccf-4598-41e9-b490-b9c000303134_1388x576.png 1272w, https://substackcdn.com/image/fetch/$s_!g71i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8adccf-4598-41e9-b490-b9c000303134_1388x576.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g71i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8adccf-4598-41e9-b490-b9c000303134_1388x576.png" width="1388" height="576" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c8adccf-4598-41e9-b490-b9c000303134_1388x576.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:576,&quot;width&quot;:1388,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59680,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/174493254?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8adccf-4598-41e9-b490-b9c000303134_1388x576.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g71i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8adccf-4598-41e9-b490-b9c000303134_1388x576.png 424w, https://substackcdn.com/image/fetch/$s_!g71i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8adccf-4598-41e9-b490-b9c000303134_1388x576.png 848w, https://substackcdn.com/image/fetch/$s_!g71i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8adccf-4598-41e9-b490-b9c000303134_1388x576.png 1272w, https://substackcdn.com/image/fetch/$s_!g71i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8adccf-4598-41e9-b490-b9c000303134_1388x576.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How numbers are converted from 8 to 4 bits representation</figcaption></figure></div><p>So the number 33 in 8-bit becomes 2 in 4-bit. When we need to use it again, we scale it back up by multiplying 2 by 16, which gives us 32. We lost a tiny bit of precision because 33 became 32, but <strong>we saved 50% of the memory</strong>.</p><p>This process happens for every single weight in the model, billions of times over. The accumulated small losses in precision are what lead to that minimal performance degradation I mentioned earlier.</p><blockquote><p><em>To learn how different quantization techniques work in more detail, I recommend reading <a href="http://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization">this article by Maarten Grootendorst</a>.</em></p></blockquote><h2>Why This Doesn&#8217;t Break Your Model</h2><p>You might be wondering why losing precision on billions of numbers doesn&#8217;t make the model terrible. The answer lies in how neural networks actually work.</p><blockquote><p><em>LLMs are surprisingly robust to small amounts of noise.</em></p></blockquote><p>They are extremely well optimized during training, so that they actually learn to be noise-resistant. This built-in resilience is what makes quantization possible <strong>without destroying performance</strong>.</p><p>Additionally, researchers use clever techniques to <strong>minimize the impact</strong>. Asymmetric quantization adjusts the mapping to better fit the actual distribution of weights. Per-channel quantization uses different scale factors for different parts of the model. Mixed precision keeps critical layers in higher precision while quantizing others more aggressively.</p><p>You don&#8217;t need to implement these techniques yourself because they&#8217;re built into modern quantization tools.</p><h2>Common Quantization Formats</h2><p>When you go looking for quantized models, you&#8217;ll see several formats. Understanding what they mean will help you choose the right one for your needs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BRwc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d448f0d-e924-4b35-bab2-360c9f4c59ab_2436x960.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BRwc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d448f0d-e924-4b35-bab2-360c9f4c59ab_2436x960.png 424w, https://substackcdn.com/image/fetch/$s_!BRwc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d448f0d-e924-4b35-bab2-360c9f4c59ab_2436x960.png 848w, https://substackcdn.com/image/fetch/$s_!BRwc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d448f0d-e924-4b35-bab2-360c9f4c59ab_2436x960.png 1272w, https://substackcdn.com/image/fetch/$s_!BRwc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d448f0d-e924-4b35-bab2-360c9f4c59ab_2436x960.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BRwc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d448f0d-e924-4b35-bab2-360c9f4c59ab_2436x960.png" width="1456" height="574" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d448f0d-e924-4b35-bab2-360c9f4c59ab_2436x960.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:574,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:773484,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/174493254?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d448f0d-e924-4b35-bab2-360c9f4c59ab_2436x960.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BRwc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d448f0d-e924-4b35-bab2-360c9f4c59ab_2436x960.png 424w, https://substackcdn.com/image/fetch/$s_!BRwc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d448f0d-e924-4b35-bab2-360c9f4c59ab_2436x960.png 848w, https://substackcdn.com/image/fetch/$s_!BRwc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d448f0d-e924-4b35-bab2-360c9f4c59ab_2436x960.png 1272w, https://substackcdn.com/image/fetch/$s_!BRwc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d448f0d-e924-4b35-bab2-360c9f4c59ab_2436x960.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">source <a href="https://huggingface.co/docs/transformers/main/quantization/selecting">Hugging Face</a></figcaption></figure></div><p>INT8 and Q8_0 refer to 8-bit integer quantization. This format provides <strong>50% memory reduction</strong> <a href="https://developers.redhat.com/articles/2024/10/17/we-ran-over-half-million-evaluations-quantized-llms#">with 99% or better performance retention</a>. It&#8217;s best for production deployments where you want maximum safety and reliability.</p><p><a href="https://arxiv.org/abs/2210.17323">GPTQ</a> is a 4-bit quantization method that provides <strong>75% memory reduction</strong> with <a href="https://arxiv.org/abs/2411.02355">98% performance retention</a>. It&#8217;s optimized for GPU inference and works best when you&#8217;re trying to run larger models on consumer hardware.</p><p><a href="https://huggingface.co/docs/hub/gguf">GGUF</a> (formerly called GGML) is a flexible quantization format that supports anywhere from 2 to 8 bits. You&#8217;ll see various formats like Q4_K_M, Q5_K_S, and Q8_0. This format is optimized for CPU and Apple Silicon inference and powers popular tools like <a href="https://ollama.com/">Ollama</a> and <a href="https://lmstudio.ai/">LM Studio</a>.</p><div class="pullquote"><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.aiunpacked.net/p/quantization-made-simple-how-to-run?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Share this with a friend who&#8217;s curious about <a href="http://aiunpacked.net/p/what-is-ai-engineering">AI Engineering</a></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.aiunpacked.net/p/quantization-made-simple-how-to-run?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.aiunpacked.net/p/quantization-made-simple-how-to-run?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div></div><h2>Performance Expectations</h2><p>Different quantization levels give you different trade-offs between size and quality. With 8-bit quantization using INT8, you&#8217;ll barely notice any difference. I mean it when I say <a href="https://arxiv.org/abs/2208.07339">the performance is virtually identical to the original model</a>.</p><p>With 4-bit quantization like Q4, you might see a slight quality reduction in very specific edge cases, but most users won&#8217;t notice in typical usage. With 3-bit or lower quantization, you&#8217;ll see noticeable quality degradation, so only use these formats if you&#8217;re desperate for memory.</p><blockquote><p><em>The sweet spot for most people is 8-bit for critical production use and 4-bit for experimentation and local development.</em></p></blockquote><h2>Your Action Plan</h2><h4>Rule #1: Always Use 8-bit When Running Locally</h4><p>If you&#8217;re deploying an LLM on your own, there&#8217;s no reason not to use 8-bit quantization. The performance difference is negligible, and you&#8217;ll save 50% on memory costs. It&#8217;s essentially free optimization.</p><h4>Rule #2: Calculate Before You Download</h4><p>Before pulling a model, you should check whether it&#8217;ll actually fit on your hardware. First, find the parameter count, which is usually in the model name like &#8220;Llama-2-7b&#8221; or &#8220;Mistral-7B&#8221;. Next, decide on your quantization level. Then apply the formula I showed you earlier. Finally, add a 20% buffer for KV-cache to be safe.</p><p>Quick Reference Table:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FNb_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed49518-8bff-4f84-9e63-ab1e3f3f5f08_1802x1156.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FNb_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed49518-8bff-4f84-9e63-ab1e3f3f5f08_1802x1156.png 424w, https://substackcdn.com/image/fetch/$s_!FNb_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed49518-8bff-4f84-9e63-ab1e3f3f5f08_1802x1156.png 848w, https://substackcdn.com/image/fetch/$s_!FNb_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed49518-8bff-4f84-9e63-ab1e3f3f5f08_1802x1156.png 1272w, https://substackcdn.com/image/fetch/$s_!FNb_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed49518-8bff-4f84-9e63-ab1e3f3f5f08_1802x1156.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FNb_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed49518-8bff-4f84-9e63-ab1e3f3f5f08_1802x1156.png" width="546" height="350.25" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ed49518-8bff-4f84-9e63-ab1e3f3f5f08_1802x1156.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:934,&quot;width&quot;:1456,&quot;resizeWidth&quot;:546,&quot;bytes&quot;:698025,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/174493254?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed49518-8bff-4f84-9e63-ab1e3f3f5f08_1802x1156.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FNb_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed49518-8bff-4f84-9e63-ab1e3f3f5f08_1802x1156.png 424w, https://substackcdn.com/image/fetch/$s_!FNb_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed49518-8bff-4f84-9e63-ab1e3f3f5f08_1802x1156.png 848w, https://substackcdn.com/image/fetch/$s_!FNb_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed49518-8bff-4f84-9e63-ab1e3f3f5f08_1802x1156.png 1272w, https://substackcdn.com/image/fetch/$s_!FNb_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ed49518-8bff-4f84-9e63-ab1e3f3f5f08_1802x1156.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These estimates assume a 4K token context window. Larger context windows (8K, 32K, etc.) will require additional memory. If you&#8217;re constrained by VRAM, you can reduce the context window to fit your hardware.</p><h4>Rule #3: Where to Find Quantized Models</h4><p>You have two main options for getting quantized models.</p><p><strong>Option 1</strong> is using pre-quantized models on Hugging Face. Most popular models already have pre-quantized versions available. You can search for the model name plus &#8220;GPTQ&#8221; if you need GPU inference, or the model name plus &#8220;GGUF&#8221; if you need CPU or Mac inference. For example, instead of searching for &#8220;meta-llama/Llama-2-7b-hf&#8221;, you would search for &#8220;TheBloke/Llama-2-7B-GPTQ&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-2sA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F231723cb-98dd-4dcb-a953-6e9d6aa7cb23_1442x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-2sA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F231723cb-98dd-4dcb-a953-6e9d6aa7cb23_1442x450.png 424w, https://substackcdn.com/image/fetch/$s_!-2sA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F231723cb-98dd-4dcb-a953-6e9d6aa7cb23_1442x450.png 848w, https://substackcdn.com/image/fetch/$s_!-2sA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F231723cb-98dd-4dcb-a953-6e9d6aa7cb23_1442x450.png 1272w, https://substackcdn.com/image/fetch/$s_!-2sA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F231723cb-98dd-4dcb-a953-6e9d6aa7cb23_1442x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-2sA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F231723cb-98dd-4dcb-a953-6e9d6aa7cb23_1442x450.png" width="1442" height="450" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/231723cb-98dd-4dcb-a953-6e9d6aa7cb23_1442x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:1442,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:217114,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/174493254?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F231723cb-98dd-4dcb-a953-6e9d6aa7cb23_1442x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-2sA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F231723cb-98dd-4dcb-a953-6e9d6aa7cb23_1442x450.png 424w, https://substackcdn.com/image/fetch/$s_!-2sA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F231723cb-98dd-4dcb-a953-6e9d6aa7cb23_1442x450.png 848w, https://substackcdn.com/image/fetch/$s_!-2sA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F231723cb-98dd-4dcb-a953-6e9d6aa7cb23_1442x450.png 1272w, https://substackcdn.com/image/fetch/$s_!-2sA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F231723cb-98dd-4dcb-a953-6e9d6aa7cb23_1442x450.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">You can find quantizations on the model page</figcaption></figure></div><p><strong>Option 2</strong> is quantizing the model yourself if you have a custom model or can&#8217;t find what you need. For GPTQ format, you can use the AutoGPTQ library. For GGUF format, you can use the llama.cpp conversion tools. For general quantization, you can use <a href="https://github.com/vllm-project/llm-compressor">llm-compressor</a> by <a href="https://github.com/vllm-project/vllm">vLLM</a>.</p><p>Most tools require just a single command to quantize your model:</p><pre><code># Example with llm-compressor
llmcompressor quantize your-model --format int8</code></pre><h4>Rule #4: Test Before You Commit</h4><p>Before deploying a quantized model, you should run your specific use cases through it. Create a <strong>small test set</strong> that includes typical queries you expect, edge cases that matter to your application, and quality metrics you care about.</p><p>Compare the quantized version against the original. In most cases with 8-bit, you&#8217;ll see identical results. With 4-bit, you might see tiny differences that you need to evaluate for your use case.</p><h4>Wrapping up</h4><p>Quantization isn&#8217;t a hack or a workaround. It&#8217;s a fundamental technique that makes LLMs accessible. It&#8217;s the reason you can run powerful models on consumer hardware. It&#8217;s why small teams can compete with big labs on deployment. It&#8217;s how that healthcare project actually shipped.</p><blockquote><p><em><a href="https://developers.redhat.com/articles/2024/10/17/we-ran-over-half-million-evaluations-quantized-llms#why_quantization_is_here_to_stay">&#8220;Quantization is an essential tool for optimizing LLMs in real-world deployments.&#8221;</a></em></p></blockquote><p>There are a few key points you should remember:</p><ul><li><p>First, <strong>8-bit quantization is practically free performance-wise</strong>, so use it by default.</p></li><li><p>Second, memory needed equals the billions of parameters multiplied by bytes per weight, multiplied by 1.2 for safety.</p></li><li><p>Third, most quantized models are pre-made and ready to download.</p></li><li><p>Fourth, when in doubt, try it because you can always go back to higher precision if needed.</p></li></ul><blockquote><p><em>The world of AI is moving fast, but it&#8217;s also becoming more accessible.</em></p></blockquote><p>You don&#8217;t need a server farm to run state-of-the-art models anymore. You just need to know how to make them fit.</p><p>Now go make that model run on your hardware.</p><div><hr></div><p><em>Have questions about quantization or want to share your own deployment story? Comment below and I will respond to every question.</em></p><div><hr></div>]]></content:encoded></item><item><title><![CDATA[Sampling in Large Language Models]]></title><description><![CDATA[or How LLMs get creative]]></description><link>https://www.aiunpacked.net/p/sampling-in-large-language-models</link><guid isPermaLink="false">https://www.aiunpacked.net/p/sampling-in-large-language-models</guid><dc:creator><![CDATA[Max]]></dc:creator><pubDate>Thu, 18 Sep 2025 14:01:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!3fGV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Recently, I had a technical interview with one of the few companies that build Large Language Models (LLMs). During the interview, I was asked about <strong>sampling in LLMs</strong>: the strategies that exist, why they are needed, how they work, and even to implement some of them. Thanks to the knowledge I&#8217;ve built over my career, I handled the interview confidently. Today, I&#8217;ll share <strong>everything you need to know about sampling</strong>. Whether you&#8217;re an AI engineer or an enthusiast, this overview will give you the fundamentals needed to better understand and work with these models.</p><p>You can expect to get through this issue in about <strong>6 minutes</strong>.</p><h2>What is sampling and why do we need it?</h2><p>Any artificial neural network (including an LLM) is just an <strong>extremely complex mathematical formula</strong>. Which means, the output is just a product of inputs and some static matrices. In other words, given the same input, LLM produces the same output every time.</p><p>This behaviour is fine for most of the applications: for example when we want our model to predict if an email is spam, we expect the same prediction for the same email every time. But that&#8217;s not the case for LLMs, where we often want it to be more &#8220;creative&#8221; and generate <strong>different responses</strong> each time we say &#8220;Hello&#8221;.</p><p>So how come LLMs generate each time a different answer?</p><h2>Sampling Strategies</h2><p>To answer this question, we first need to understand how these models work. I have a whole <strong><a href="https://www.aiunpacked.net/p/large-language-models-explained">issue explaining how LLMs work</a></strong>. The key thing to understand is that at each generation step, the LLM&#8217;s final layer assigns a score to every word <em>(token)</em> in its vocabulary. These numbers reflect how likely the model thinks each word should come next.</p><blockquote><p><em>At each step, an LLM predicts a logit (number) for every possible next token.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3fGV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3fGV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png 424w, https://substackcdn.com/image/fetch/$s_!3fGV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png 848w, https://substackcdn.com/image/fetch/$s_!3fGV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png 1272w, https://substackcdn.com/image/fetch/$s_!3fGV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3fGV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png" width="1456" height="483" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:483,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:141763,&quot;alt&quot;:&quot;\&quot;The sky is...\&quot; being input into an LLM. The LLM predicting possible completions with raw scores (logits):\&quot;blue\&quot; (9), \&quot;cloudy\&quot; (7), \&quot;grey\&quot; (6), and \&quot;red\&quot; (3).&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/173053547?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="&quot;The sky is...&quot; being input into an LLM. The LLM predicting possible completions with raw scores (logits):&quot;blue&quot; (9), &quot;cloudy&quot; (7), &quot;grey&quot; (6), and &quot;red&quot; (3)." title="&quot;The sky is...&quot; being input into an LLM. The LLM predicting possible completions with raw scores (logits):&quot;blue&quot; (9), &quot;cloudy&quot; (7), &quot;grey&quot; (6), and &quot;red&quot; (3)." srcset="https://substackcdn.com/image/fetch/$s_!3fGV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png 424w, https://substackcdn.com/image/fetch/$s_!3fGV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png 848w, https://substackcdn.com/image/fetch/$s_!3fGV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png 1272w, https://substackcdn.com/image/fetch/$s_!3fGV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cca9652-f2e6-4190-97a0-a503d3db1962_1820x604.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An LLM predicting possible completions with raw scores (logits)</figcaption></figure></div><p>The simplest strategy is to choose the token with the highest logit. This is called &#8220;<strong>greedy decoding</strong>&#8221; and it produces the same response every time you say &#8220;Hello&#8221;.</p><p>Instead of always picking the top token, some sampling strategies also consider other options. For example, for the sentence <em>&#8220;The sky is&#8230;&#8221;</em> the most likely word is <em>&#8220;blue&#8221;</em>, but the model might also choose <em>&#8220;cloudy&#8221;</em>, <em>&#8220;gray&#8221;</em>, or even <em>&#8220;red&#8221;</em>. This allows LLMs respond in a more creative and engaging way.</p><p>Now that we understand the basics of sampling, let&#8217;s look at the strategies most commonly used and how they work.</p><h3>Converting logits to probabilities</h3><p>To choose tokens based on probabilities, we first need to convert the logits (raw scores) into probabilities that sum to 1.</p><blockquote><p><em>The key mathematical formula to make sampling strategies work is <strong><a href="https://en.wikipedia.org/wiki/Softmax_function">softmax</a></strong>.</em></p></blockquote><p>The softmax equation looks like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{e^{z_i}}{\\sum_{j=1}^{K} e^{z_j}}&quot;,&quot;id&quot;:&quot;ITBAFMPEKN&quot;}" data-component-name="LatexBlockToDOM"></div><p>, where <em>z_i</em>&#8203; is the logit for token <em>i</em>, and <em>K</em> is the total number of tokens in the vocabulary. This function <strong>turns the set of logits into a probability distribution</strong>: all values are between 0 and 1, and they add up to 1. </p><p>To better understand this, let&#8217;s say the model needs to complete the sentence:</p><div class="pullquote"><p><em>The sky is&#8230;</em></p></div><p>It now has the option to choose one of 4 words: <em>blue</em>, <em>cloudy</em>, <em>gray</em>, or <em>red</em>. Each of these words has an assigned logit: <code>9</code>, <code>7</code>, <code>6</code>, and <code>3</code>, respectively. Softmax converts these logits into a set of probabilities:</p><ul><li><p><em>blue</em> <code>84.2%</code></p></li><li><p><em>cloudy</em> <code>11.4%</code></p></li><li><p><em>gray</em> <code>4.2%</code></p></li><li><p><em>red</em> <code>0.2%</code>.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YOui!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1228f7-291e-40e1-ac3f-5cd1818aeddb_2824x728.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YOui!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1228f7-291e-40e1-ac3f-5cd1818aeddb_2824x728.png 424w, https://substackcdn.com/image/fetch/$s_!YOui!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1228f7-291e-40e1-ac3f-5cd1818aeddb_2824x728.png 848w, https://substackcdn.com/image/fetch/$s_!YOui!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1228f7-291e-40e1-ac3f-5cd1818aeddb_2824x728.png 1272w, https://substackcdn.com/image/fetch/$s_!YOui!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1228f7-291e-40e1-ac3f-5cd1818aeddb_2824x728.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YOui!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1228f7-291e-40e1-ac3f-5cd1818aeddb_2824x728.png" width="1456" height="375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b1228f7-291e-40e1-ac3f-5cd1818aeddb_2824x728.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:375,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:236030,&quot;alt&quot;:&quot;An LLM predicts possible completions with logits, then softmax converts them into a set of probabilities.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/173053547?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1228f7-291e-40e1-ac3f-5cd1818aeddb_2824x728.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="An LLM predicts possible completions with logits, then softmax converts them into a set of probabilities." title="An LLM predicts possible completions with logits, then softmax converts them into a set of probabilities." srcset="https://substackcdn.com/image/fetch/$s_!YOui!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1228f7-291e-40e1-ac3f-5cd1818aeddb_2824x728.png 424w, https://substackcdn.com/image/fetch/$s_!YOui!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1228f7-291e-40e1-ac3f-5cd1818aeddb_2824x728.png 848w, https://substackcdn.com/image/fetch/$s_!YOui!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1228f7-291e-40e1-ac3f-5cd1818aeddb_2824x728.png 1272w, https://substackcdn.com/image/fetch/$s_!YOui!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1228f7-291e-40e1-ac3f-5cd1818aeddb_2824x728.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An LLM predicts possible completions with logits, then softmax converts them into a set of probabilities.</figcaption></figure></div><p>Now, instead of choosing word <em>&#8220;blue&#8221; </em>every time, we can choose one of these words according to the probability distribution. If we randomly sampled from this distribution 100 times, we&#8217;d expect to get <em>&#8220;blue&#8221;</em> about 84 times, <em>&#8220;cloudy&#8221;</em> about 11 times, etc.</p><h3>Temperature</h3><p>One of the most common parameters that control randomness of the output is called <strong>temperature</strong>.</p><p>In the softmax function, temperature is a constant, used to divide all logits. This makes the resulting probability distribution more &#8220;sharp&#8221; <em>(if temperature &lt; 1)</em> or more &#8220;flat&#8221; <em>(if it&#8217;s &gt; 1)</em>. In other words, <strong>the higher the temperature is, the closer the probabilities become to each other</strong>, so the model is more likely to pick less probable tokens. Lowering the temperature has the opposite effect: it sharpens the distribution and makes the model stick to the most likely tokens.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{e^{z_i / T}}{\\sum_{j=1}^{K} e^{z_j / T}}&quot;,&quot;id&quot;:&quot;DUGSUHSBZS&quot;}" data-component-name="LatexBlockToDOM"></div><p>If we set the temperature (T) to <code>5</code> and calculate probabilities for our example again, we would get:</p><ul><li><p><em>blue</em> <code>39.7%</code></p></li><li><p><em>cloudy</em> <code>26.6%</code></p></li><li><p><em>gray</em> <code>21.8%</code></p></li><li><p><em>red</em> <code>12.0%</code>.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pyaV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d93e67-cc31-4831-8d98-97887ecd88fa_2428x656.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pyaV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d93e67-cc31-4831-8d98-97887ecd88fa_2428x656.png 424w, https://substackcdn.com/image/fetch/$s_!pyaV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d93e67-cc31-4831-8d98-97887ecd88fa_2428x656.png 848w, https://substackcdn.com/image/fetch/$s_!pyaV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d93e67-cc31-4831-8d98-97887ecd88fa_2428x656.png 1272w, https://substackcdn.com/image/fetch/$s_!pyaV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d93e67-cc31-4831-8d98-97887ecd88fa_2428x656.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pyaV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d93e67-cc31-4831-8d98-97887ecd88fa_2428x656.png" width="1456" height="393" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/17d93e67-cc31-4831-8d98-97887ecd88fa_2428x656.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:393,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:153406,&quot;alt&quot;:&quot;Shift of probability distribution after increasing the temperature parameter&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/173053547?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d93e67-cc31-4831-8d98-97887ecd88fa_2428x656.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Shift of probability distribution after increasing the temperature parameter" title="Shift of probability distribution after increasing the temperature parameter" srcset="https://substackcdn.com/image/fetch/$s_!pyaV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d93e67-cc31-4831-8d98-97887ecd88fa_2428x656.png 424w, https://substackcdn.com/image/fetch/$s_!pyaV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d93e67-cc31-4831-8d98-97887ecd88fa_2428x656.png 848w, https://substackcdn.com/image/fetch/$s_!pyaV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d93e67-cc31-4831-8d98-97887ecd88fa_2428x656.png 1272w, https://substackcdn.com/image/fetch/$s_!pyaV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d93e67-cc31-4831-8d98-97887ecd88fa_2428x656.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Shift of probability distribution after increasing the temperature parameter</figcaption></figure></div><p>As you can see, the distribution becomes much flatter, and words that were unlikely before now have a much higher chance of being chosen.</p><blockquote><p><em>Higher temperature makes the model&#8217;s output more diverse but also more &#8220;risky&#8221;.</em></p></blockquote><p>It&#8217;s common to set temperature to 0 for consistent outputs. Technically, it can&#8217;t be 0 since logits can&#8217;t be divided by zero. In practice, this means the model simply does &#8220;greedy decoding&#8221;, <strong>skipping adjustment and softmax</strong>.</p><p>Try adjusting the temperature <a href="https://claude.ai/public/artifacts/2035b48e-b79e-4605-8d86-53406485a286?fullscreen=true">here</a> and watch how the probability distribution shifts.</p><h3>Top-K</h3><p>Calculating softmax for every logit in the LLM vocabulary, which can be as large as 128,000 tokens, is computationally expensive. Instead of sampling from the entire vocabulary, the Top-K strategy considers only the <strong>tokens with the top k logits</strong> (where k is a parameter). For example, if k = 50, probabilities are calculated only for those 50 tokens instead of all 128,000.</p><blockquote><p><em>A smaller k value makes the text more predictable but less interesting.</em></p></blockquote><h3>Top-P</h3><p>As you can imagine, always sampling from the top K tokens can be suboptimal. For a yes/no question, the model should ideally choose between just two tokens: <em>yes</em> or <em>no</em>. But if you ask it to write a poem, you want a larger pool of tokens to encourage creativity.</p><p>That&#8217;s where the <strong>Top-P </strong><em><strong>(Nucleus)</strong></em><strong> sampling</strong> comes in. Instead of fixing K, it selects the smallest set of tokens whose probabilities add up to a threshold, usually 0.9 or 0.95. Since the probabilities of all tokens sum to 1, this subset covers the most likely ones while excluding very unlikely options.</p><p>In our earlier example, where <em>&#8220;blue&#8221;</em> has probability 0.84 and <em>&#8220;cloudy&#8221;</em> 0.11, setting P = 0.95 would limit sampling to these two tokens, since together they reach the threshold.</p><blockquote><p><em>Top-P strategy doesn&#8217;t make sampling more efficient but it makes responses more coherent.</em></p></blockquote><h3>Stopping Condition</h3><p>So we asked an LLM to complete a sentence. It generated logits for all possible next words, softmax turned them into probabilities, and a sampling strategy picked the next word. The process repeated. But <strong>when does it stop?</strong></p><p>There are two stopping conditions:</p><ol><li><p>The output hits the maximum token limit.<br><strong>This is a parameter you can set</strong>. Stopping this way is not ideal, since it either cuts the response mid-sentence or produces an overly long, costly output.</p></li><li><p>The LLM generates an <code>&lt;end_of_sequence&gt;</code> token.<br>This is the usual and preferred condition. <strong>LLMs are trained to produce a special token when the response is complete</strong>. You can think of it like pressing &#8220;send&#8221; after finishing a message.</p></li></ol><h3>Constrained Sampling</h3><p>Many tasks require an LLM to generate output that follows a specific grammar. For example, it might need to produce a valid SQL query or a JSON object that matches a schema. This is critical because <strong>LLM outputs are often used in applications</strong>, and even a missing bracket in JSON can break downstream steps.</p><p>Even <a href="https://www.aiunpacked.net/p/prompt-engineering-guide">prompt engineering</a> won&#8217;t guarantee that the LLM will follow your instructions and stick to the right format, whether you say &#8220;please&#8221; or not. To solve this, we can <strong>constrain sampling</strong> to tokens that<strong> preserve the grammar</strong>.</p><p>Previous sampling strategies focused on weighted sampling from a subset of tokens based on K or P parameters. Constrained sampling goes even further by allowing the model to <strong>choose only tokens that keep the output valid</strong>.</p><p>This can also speed up generation. Some tokens are almost guaranteed to follow others, such as a closing bracket after an opening one. In these cases, the model can skip sampling and output the token directly.</p><blockquote><p><em>Constrained sampling paired with greedy decoding might turn your LLM into the most powerful <strong>extraction tool</strong>.</em></p></blockquote><p>Constrained sampling is powerful and should be in every AI engineer&#8217;s toolkit, but it has downsides. It can be hard to implement, though many providers and engines (such as vLLM) support common grammars out of the box. It has also <a href="https://arxiv.org/abs/2408.02442">been shown to </a><strong><a href="https://arxiv.org/abs/2408.02442">reduce LLM performance on reasoning tasks.</a></strong></p><h2>What&#8217;s next?</h2><p>You can <strong>experiment yourself </strong>with different sampling techniques! <a href="https://github.com/maxmuzych/ai-engineering-unpacked/tree/main/sampling-in-llms">I&#8217;ve created a python notebook </a>for you to explore and understand sampling from scratch.</p><p>Now you should be able to answer these questions confidently:</p><ol><li><p><em>What is sampling in LLMs and why do we need it?</em></p></li><li><p><em>What sampling strategies exist?</em></p></li><li><p><em>What are the limitations of &#8220;greedy decoding&#8221;?</em></p></li><li><p><em>How does Top-K and Top-P strategy work?</em></p></li></ol><p><strong>If you have any questions</strong>, leave a comment or <a href="https://www.linkedin.com/in/max-muz/">reach out to me on LinkedIn</a>.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aiunpacked.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading <em><strong><a href="http://www.aiunpacked.net">AI Engineering Unpacked</a>!</strong></em> Subscribe for free to learn how AI works and how to build real-world AI applications.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Prompt Engineering 101]]></title><description><![CDATA[Prompts are the starting point for any AI app, from chatbots to autonomous agents. Learn prompt engineering to build better AI apps.]]></description><link>https://www.aiunpacked.net/p/prompt-engineering-guide</link><guid isPermaLink="false">https://www.aiunpacked.net/p/prompt-engineering-guide</guid><dc:creator><![CDATA[Max]]></dc:creator><pubDate>Wed, 02 Jul 2025 13:25:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/95bf8323-3a2f-4a13-a7ff-43893e19168a_1024x998.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, Large Language Models (LLMs) have become so capable that they are used to create, automate, and educate. <a href="https://www.aiunpacked.net/p/large-language-models-explained">In the previous issue, I explained how they work</a>. Essentially, <strong>LLMs complete the input</strong> sequence you give them. This means the way you interact with them directly shapes their behavior. Whether you're building with them or simply using them, knowing how to prompt them is a key skill.</p><p><em>Note: a <strong>prompt</strong> is the input given to an LLM.</em></p><p>In any AI project that involves an LLM, <strong>prompt engineering is often the starting point</strong>. With a thoughtfully designed prompt, much of the work can be handled right from the beginning. But the final refinements and reliability are often the hardest to achieve.</p><p>Prompting may look simple at first, but under the hood it is a design problem. You are steering a probabilistic system, and <strong>small changes in the prompt can lead to very different outputs</strong>.</p><p>Prompt engineering has even become a standalone job title in some companies. While I believe it should ultimately be part of every <a href="https://www.aiunpacked.net/i/165390267/core-techniques">AI engineer&#8217;s skill set</a>, the fact that it is recognized as its own role shows just how important it has become.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OmjT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f6b9c4-bc6f-4cb8-bbaa-3236f2c50df2_640x631.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OmjT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f6b9c4-bc6f-4cb8-bbaa-3236f2c50df2_640x631.png 424w, https://substackcdn.com/image/fetch/$s_!OmjT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f6b9c4-bc6f-4cb8-bbaa-3236f2c50df2_640x631.png 848w, https://substackcdn.com/image/fetch/$s_!OmjT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f6b9c4-bc6f-4cb8-bbaa-3236f2c50df2_640x631.png 1272w, https://substackcdn.com/image/fetch/$s_!OmjT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f6b9c4-bc6f-4cb8-bbaa-3236f2c50df2_640x631.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OmjT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f6b9c4-bc6f-4cb8-bbaa-3236f2c50df2_640x631.png" width="418" height="412.121875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3f6b9c4-bc6f-4cb8-bbaa-3236f2c50df2_640x631.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:631,&quot;width&quot;:640,&quot;resizeWidth&quot;:418,&quot;bytes&quot;:326853,&quot;alt&quot;:&quot;Meme about Prompt Engineer&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/166889915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f6b9c4-bc6f-4cb8-bbaa-3236f2c50df2_640x631.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Meme about Prompt Engineer" title="Meme about Prompt Engineer" srcset="https://substackcdn.com/image/fetch/$s_!OmjT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f6b9c4-bc6f-4cb8-bbaa-3236f2c50df2_640x631.png 424w, https://substackcdn.com/image/fetch/$s_!OmjT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f6b9c4-bc6f-4cb8-bbaa-3236f2c50df2_640x631.png 848w, https://substackcdn.com/image/fetch/$s_!OmjT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f6b9c4-bc6f-4cb8-bbaa-3236f2c50df2_640x631.png 1272w, https://substackcdn.com/image/fetch/$s_!OmjT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3f6b9c4-bc6f-4cb8-bbaa-3236f2c50df2_640x631.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.reddit.com/r/ProgrammerHumor/comments/1c27dj7/heknewwhathewasdoing/">Meme from Reddit</a></figcaption></figure></div><blockquote><p><em>&#8220;The Problem is not with prompt engineering. It&#8217;s a real and useful skill to have. The <strong>problem is when prompt engineering is the only thing</strong> people know.&#8221;<br>- </em>OpenAI Research Manager, when interviewed for <a href="https://www.oreilly.com/library/view/ai-engineering/9781098166298/">AIE book</a>.</p></blockquote><h3>&#128161;In This Issue</h3><p>We'll explore how to interact with models more effectively. You&#8217;ll learn the core principles behind good prompts, the mechanisms that shape model behavior, and the techniques that separate average outputs from great ones. Whether you're aiming for more control, better results, or just a deeper understanding of how these systems respond, this issue will give you the tools to get there.</p><blockquote><p><em>Prompt Engineering is the easiest and most common way to adapt LLMs.</em></p></blockquote><h2>Technical Details</h2><p>Before diving into prompt engineering techniques, it's important to understand a few core concepts.</p><h3>Tokens</h3><blockquote><p><em>Tokens are the true &#8220;atoms&#8221; of LLMs.</em></p></blockquote><p>Models don&#8217;t work with text directly, but instead they process and generate <strong>tokens</strong>. Understanding <a href="https://www.aiunpacked.net/i/166234158/tokenization-translating-words-into-numbers">how tokenization works</a> is important for efficient prompting. While it is out of scope for this issue, here is one thing to keep in mind.</p><ul><li><p><strong>Typos and odd formatting increase token count</strong><br>Misspelled or oddly structured text may be broken into more tokens, wasting space.</p></li></ul><h3>System and User Prompts, and Messages</h3><p>The <strong>system prompt</strong> is an initial instruction that sets the tone, style, behavior, or constraints for the model. For example, there is a hidden system prompt behind every ChatGPT conversation. It&#8217;s normally hidden from users, but they have been <a href="https://github.com/elder-plinius/CL4R1T4S">leaked in the past</a>.</p><p>If you're a developer using an API, you must <strong>define the system prompt</strong> yourself. It's typically the first message in the input list, labeled with the role <code>"system"</code>, and it serves to guide the model&#8217;s behavior at a high level.</p><p>The <strong>user prompt</strong> is what you, or the end user, actually type. There can be multiple user messages over the course of a conversation. These are typically labeled as <code>"user"</code> when using the API, and they contain the actual instructions, questions, or inputs you want the model to respond to.</p><p>When using the API, you normally construct a conversation as a <strong>list of messages</strong>, each with a role: <code>"system"</code>, <code>"user"</code>, or <code>"assistant"</code>.</p><p>This message list is then <strong>combined</strong> into a single prompt behind the scenes using the model's tokenizer. Different providers have slightly different formatting, and some models (like DeepSeek&#8217;s R1) even <a href="https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/README.md#usage-recommendations">recommend avoiding a system prompt altogether</a>.</p><p>Understanding this message structure is key for anyone building interactive applications, especially those that rely on multi-turn conversations or consistent behavior across responses.</p><h3>Special tokens</h3><p>Special tokens are reserved tokens that serve structural or functional purposes. They might mark the beginning of a sequence, signal the start or end of a system or user message, or indicate when generation should stop.</p><p>For example, once a model generates a special end-of-sequence token, generation is terminated. Otherwise, it would continue until hitting a token limit.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!faQs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fc74a58-26c0-4b9e-831d-19cd1428065d_784x306.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!faQs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fc74a58-26c0-4b9e-831d-19cd1428065d_784x306.png 424w, https://substackcdn.com/image/fetch/$s_!faQs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fc74a58-26c0-4b9e-831d-19cd1428065d_784x306.png 848w, https://substackcdn.com/image/fetch/$s_!faQs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fc74a58-26c0-4b9e-831d-19cd1428065d_784x306.png 1272w, https://substackcdn.com/image/fetch/$s_!faQs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fc74a58-26c0-4b9e-831d-19cd1428065d_784x306.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!faQs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fc74a58-26c0-4b9e-831d-19cd1428065d_784x306.png" width="784" height="306" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8fc74a58-26c0-4b9e-831d-19cd1428065d_784x306.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:306,&quot;width&quot;:784,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49777,&quot;alt&quot;:&quot;System and User prompts formatted with special tokens for GPT-3.5-turbo&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/166889915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fc74a58-26c0-4b9e-831d-19cd1428065d_784x306.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="System and User prompts formatted with special tokens for GPT-3.5-turbo" title="System and User prompts formatted with special tokens for GPT-3.5-turbo" srcset="https://substackcdn.com/image/fetch/$s_!faQs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fc74a58-26c0-4b9e-831d-19cd1428065d_784x306.png 424w, https://substackcdn.com/image/fetch/$s_!faQs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fc74a58-26c0-4b9e-831d-19cd1428065d_784x306.png 848w, https://substackcdn.com/image/fetch/$s_!faQs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fc74a58-26c0-4b9e-831d-19cd1428065d_784x306.png 1272w, https://substackcdn.com/image/fetch/$s_!faQs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fc74a58-26c0-4b9e-831d-19cd1428065d_784x306.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">System and User prompts <a href="https://tiktokenizer.vercel.app/?model=gpt-3.5-turbo">formatted with special tokens for GPT-3.5-turbo</a></figcaption></figure></div><p>If you're using a model locally, it&#8217;s important to ensure your tokenizer adds these tokens correctly. Some tokenizers do this automatically. In my experience, when using <a href="https://ai.meta.com/blog/meta-llama-3/">LLaMA-3-8B</a> for tool use, it <strong>performed poorly without special tokens</strong>, but worked well once they were added.</p><h3>Parametric vs. Non-Parametric Memory</h3><p>LLMs have two types of memory: <strong>parametric</strong> and <strong>non-parametric</strong>.</p><p>Parametric memory refers to information stored in the model&#8217;s parameters. This knowledge is acquired during training and can only be changed by updating the model&#8217;s weights. In other words, parametric memory is fixed unless the model is retrained or fine-tuned.</p><p>Non-parametric memory, on the other hand, includes everything the model sees in the current prompt. When we add extra context or information to a prompt, we are relying on non-parametric memory. This is the type of memory <strong>most accessible</strong> to developers and users.</p><h3>Chat History</h3><p>Now that we&#8217;ve covered memory types, we can explain how tools like ChatGPT appear to "remember" earlier messages. This is made possible through <strong>non-parametric memory</strong>.</p><p>Each time you send a message, it&#8217;s <strong>appended to the conversation history</strong>. With each new interaction, the full history is passed back to the model as part of the input. This is what allows the model to continue the conversation coherently. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5bSV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2cd30ea-6168-4601-a658-610153957f35_894x674.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5bSV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2cd30ea-6168-4601-a658-610153957f35_894x674.png 424w, https://substackcdn.com/image/fetch/$s_!5bSV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2cd30ea-6168-4601-a658-610153957f35_894x674.png 848w, https://substackcdn.com/image/fetch/$s_!5bSV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2cd30ea-6168-4601-a658-610153957f35_894x674.png 1272w, https://substackcdn.com/image/fetch/$s_!5bSV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2cd30ea-6168-4601-a658-610153957f35_894x674.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5bSV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2cd30ea-6168-4601-a658-610153957f35_894x674.png" width="446" height="336.2460850111857" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2cd30ea-6168-4601-a658-610153957f35_894x674.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:674,&quot;width&quot;:894,&quot;resizeWidth&quot;:446,&quot;bytes&quot;:205324,&quot;alt&quot;:&quot;Chat history example&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/166889915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2cd30ea-6168-4601-a658-610153957f35_894x674.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Chat history example" title="Chat history example" srcset="https://substackcdn.com/image/fetch/$s_!5bSV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2cd30ea-6168-4601-a658-610153957f35_894x674.png 424w, https://substackcdn.com/image/fetch/$s_!5bSV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2cd30ea-6168-4601-a658-610153957f35_894x674.png 848w, https://substackcdn.com/image/fetch/$s_!5bSV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2cd30ea-6168-4601-a658-610153957f35_894x674.png 1272w, https://substackcdn.com/image/fetch/$s_!5bSV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2cd30ea-6168-4601-a658-610153957f35_894x674.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Chat history visualized (<a href="https://python.langchain.com/docs/concepts/chat_history/">from LangChain</a>)</figcaption></figure></div><p>However, as the history grows, each response becomes more expensive to generate. Longer conversations require more tokens and compute. More importantly, <strong><a href="https://arxiv.org/abs/2505.06120">LLMs get lost in multi-turn conversations</a></strong>, with an average drop of 39% across six generation tasks. Hence, restarting a conversation when it gets too long is crucial.</p><h3>Context Window</h3><p>LLMs have a fixed <strong>context window</strong>, which is the maximum number of tokens they can process in a single input. If the total prompt exceeds this limit, older parts of the conversation may be truncated or ignored entirely.</p><p>The size of this window has increased dramatically over time. <a href="https://github.com/openai/gpt-2">GPT-2</a> had a context window of just 1,024 tokens, while state-of-the-art models like Gemini-2.5-Flash can handle up to 1 million tokens.</p><p>Studies have shown that when prompts are very long, models often <strong><a href="https://arxiv.org/abs/2307.03172">"forget" information placed in the middle</a></strong> of the input. So while longer context windows allow for more information, they don&#8217;t guarantee better performance unless the prompt is structured carefully.</p><p>Context length also affects <strong>efficiency</strong>. Overlong prompts can introduce:</p><ul><li><p>Unnecessary latency</p></li><li><p>Higher costs</p></li></ul><p>Long prompts can also <strong>degrade</strong> the model&#8217;s performance.</p><blockquote><p><em>When designing LLM applications, it&#8217;s important to balance richness of input with efficiency and relevance.</em></p></blockquote><h2>Prompting Best Practices</h2><p>The golden rule of working with LLMs is simple: <strong>Better Input &#8594; Better Output.</strong></p><p>Prompting can get incredibly tricky, as there is no guarantee that the model will follow your instructions, especially for smaller models. But <strong>systematic approach</strong> to prompt engineering can save you a lot of time.</p><h3>Be Specific</h3><p>LLMs perform best when your instructions are <strong>clear, explicit, and supported with context</strong>. Vague prompts often lead to vague or unpredictable responses.</p><h4>Specify the Role</h4><p>Assigning a role gives the model behavioral context. For example, &#8220;You are a helpful customer support assistant&#8221; nudges it toward <strong>tone</strong>,<strong> format</strong>,<strong> </strong>and<strong> intent</strong> aligned with that persona. If you're using the API, this is typically done through the <strong>system prompt</strong>.</p><blockquote><p><em><a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/system-prompts">When using LLM (Claude), you can dramatically improve its performance by using the system prompt to give it a role.</a> - Anthropic</em></p></blockquote><h4>Write Clear Instructions</h4><p>General prompts like &#8220;Help the user&#8221; leave too much room for interpretation. Instead, <strong>be explicit</strong>: &#8220;Answer customer questions about subscription plans using a friendly and professional tone.&#8221; Clear, specific directives reduce ambiguity and make the model's output more consistent.</p><h4>Provide the Context</h4><p>Without context, the model falls back on its internal training data, which may be outdated or misaligned with your task. Include<strong> relevant information</strong>, like your return policy, in the prompt to reduce hallucinations and increase accuracy.</p><h4>Provide Examples</h4><p>If there&#8217;s a particular style or format you want, show it. Providing one or more examples helps the model generalize and replicate the expected behavior. This is called <strong>In-Context Learning</strong>. For instance, if you need responses to be concise, include a short, well-structured example and tell the model to follow that pattern.</p><h3>Break down complex tasks</h3><p>LLMs struggle with ambiguity and perform inconsistently on large, multi-step tasks. If possible, <strong>decompose the workflow</strong>.</p><p>Splitting a problem into smaller steps makes your prompts easier to test, debug, and maintain. It also opens the door to parallel execution or routing simpler tasks to smaller, cheaper models.</p><p>For example, if your chatbot needs to process a refund, the task might involve:</p><ol><li><p>Identifying which items need to be refunded</p></li><li><p>Checking refund eligibility</p></li><li><p>Providing a receipt and follow-up instructions</p></li></ol><p>Rather than asking the model to handle all of this in a single prompt, you can <strong>break it into separate prompts</strong> and run them sequentially. This modular approach improves reliability and gives you more control over each step.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8N7k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07816b10-3204-4f40-8cf6-6b615d9a6721_1143x1444.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8N7k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07816b10-3204-4f40-8cf6-6b615d9a6721_1143x1444.jpeg 424w, https://substackcdn.com/image/fetch/$s_!8N7k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07816b10-3204-4f40-8cf6-6b615d9a6721_1143x1444.jpeg 848w, https://substackcdn.com/image/fetch/$s_!8N7k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07816b10-3204-4f40-8cf6-6b615d9a6721_1143x1444.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!8N7k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07816b10-3204-4f40-8cf6-6b615d9a6721_1143x1444.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8N7k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07816b10-3204-4f40-8cf6-6b615d9a6721_1143x1444.jpeg" width="336" height="424.48293963254594" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07816b10-3204-4f40-8cf6-6b615d9a6721_1143x1444.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1444,&quot;width&quot;:1143,&quot;resizeWidth&quot;:336,&quot;bytes&quot;:219998,&quot;alt&quot;:&quot;Prompt chaining example&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/166889915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20959896-0666-49e5-926f-ebbf686f09f1_1143x1791.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Prompt chaining example" title="Prompt chaining example" srcset="https://substackcdn.com/image/fetch/$s_!8N7k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07816b10-3204-4f40-8cf6-6b615d9a6721_1143x1444.jpeg 424w, https://substackcdn.com/image/fetch/$s_!8N7k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07816b10-3204-4f40-8cf6-6b615d9a6721_1143x1444.jpeg 848w, https://substackcdn.com/image/fetch/$s_!8N7k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07816b10-3204-4f40-8cf6-6b615d9a6721_1143x1444.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!8N7k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07816b10-3204-4f40-8cf6-6b615d9a6721_1143x1444.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example of prompt chaining for refund processing</figcaption></figure></div><p>The more narrow and deterministic your instructions, the more <strong>consistent</strong> and <strong>predictable</strong> your outputs will be.</p><h3>Give the model time to think</h3><p>Sometimes better results come not from adding more input, but from <strong>thinking</strong> more carefully.</p><p>One way to guide the model is by asking it to solve problems <strong>step by step</strong>. This helps it stay focused and follow a clearer line of reasoning. Another method asks the model to review and revise its own answer. That adds a layer of self-checking. Both methods aim to reduce errors and improve accuracy. They don&#8217;t always come free, as they can slow the response and use up more space.</p><h3>Iterate</h3><p>Prompt engineering is iterative by nature. Start with a basic instruction. Watch for errors. <strong>Adjust</strong>. <strong>Repeat</strong>.</p><p>Use versioned prompts and fixed test sets to evaluate systematically. Run the same prompt across different models to compare results. Intuition helps, but it&#8217;s not enough for production.</p><p>If you&#8217;re building applications, treat prompts like code. Keep them separate from the app&#8217;s logic. <strong>Version</strong> <strong>them</strong>. Annotate changes.</p><p>Without this structure, large-scale reliability is hard to maintain.</p><h2>Prompt Engineering Techniques</h2><p>For many users, following best practices will handle most cases. But if you&#8217;re building with LLMs, understanding and applying these techniques will allow you to build sophisticated <strong>AI applications</strong>.</p><h3>Few-shot Prompting</h3><p>When you ask an LLM to answer a question without giving any examples, it's called <strong>zero-shot prompting</strong>. If you include a few examples to show the model what kind of output you expect, this is known as <strong>few-shot prompting</strong>.</p><p>In the now-famous GPT-3 paper <em>&#8220;<a href="https://arxiv.org/abs/2005.14165">Language Models are Few-Shot Learners</a>&#8221;</em>, researchers at OpenAI showed that, with just a handful of examples, LLMs could perform tasks that weren&#8217;t explicitly present in their training data, such as translation, question answering, or arithmetic.</p><p>In practice, however, <strong>few-shot prompting can be a double-edged sword</strong>. In my own work with LLaMA-3-8B, I found that few-shot examples sometimes hurt more than they helped. They consume valuable space in the context window (which was only 8,000 tokens in that model), and they can lead the model to copy details from the examples instead of focusing on the input. To avoid this, I recommend using a small number of generic examples (ideally 5 to 10) and abstracting away specifics. For instance, if you're extracting phone numbers, use placeholders like <code>&lt;PHONE_NUMBER&gt;</code> in the examples instead of real data. This is especially important with smaller models.</p><h3>Chain-of-Thought</h3><p>Researchers at Google discovered that prompting the model to reason through a task step by step, significantly improves performance on reasoning-heavy problems. This approach, called <strong>Chain-of-Thought (CoT)</strong> prompting, dramatically boosted <a href="https://research.google/blog/pathways-language-model-palm-scaling-to-540-billion-parameters-for-breakthrough-performance/">PaLM-540B</a>&#8217;s performance on a <a href="https://github.com/openai/grade-school-math">grade school math benchmark</a> from 18% to 57%.</p><p>Asking a model to "think step by step" helps, but showing examples of that reasoning works better for specific tasks. A method, called <a href="https://arxiv.org/abs/2210.03493">Auto-CoT</a>, aimed to automate this process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6UtX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4177695-6e4a-4e10-879e-7e06e8520ef4_1652x828.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6UtX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4177695-6e4a-4e10-879e-7e06e8520ef4_1652x828.png 424w, https://substackcdn.com/image/fetch/$s_!6UtX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4177695-6e4a-4e10-879e-7e06e8520ef4_1652x828.png 848w, https://substackcdn.com/image/fetch/$s_!6UtX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4177695-6e4a-4e10-879e-7e06e8520ef4_1652x828.png 1272w, https://substackcdn.com/image/fetch/$s_!6UtX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4177695-6e4a-4e10-879e-7e06e8520ef4_1652x828.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6UtX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4177695-6e4a-4e10-879e-7e06e8520ef4_1652x828.png" width="1456" height="730" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4177695-6e4a-4e10-879e-7e06e8520ef4_1652x828.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:730,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:270496,&quot;alt&quot;:&quot;Standard Prompting vs Chain-of-Thought Prompting&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/166889915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4177695-6e4a-4e10-879e-7e06e8520ef4_1652x828.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Standard Prompting vs Chain-of-Thought Prompting" title="Standard Prompting vs Chain-of-Thought Prompting" srcset="https://substackcdn.com/image/fetch/$s_!6UtX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4177695-6e4a-4e10-879e-7e06e8520ef4_1652x828.png 424w, https://substackcdn.com/image/fetch/$s_!6UtX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4177695-6e4a-4e10-879e-7e06e8520ef4_1652x828.png 848w, https://substackcdn.com/image/fetch/$s_!6UtX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4177695-6e4a-4e10-879e-7e06e8520ef4_1652x828.png 1272w, https://substackcdn.com/image/fetch/$s_!6UtX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4177695-6e4a-4e10-879e-7e06e8520ef4_1652x828.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://arxiv.org/abs/2201.11903">Standard Prompting vs CoT Prompting</a></figcaption></figure></div><p>The insight here is simple: instead of treating the model like a calculator that produces an answer, you treat it like a problem solver that works through intermediate steps. This was also believed to make the model&#8217;s reasoning more transparent. However, Anthropic&#8217;s research reveals a key limitation: many <a href="https://www.anthropic.com/research/reasoning-models-dont-say-think">CoTs </a><strong><a href="https://www.anthropic.com/research/reasoning-models-dont-say-think">do not faithfully reflect</a></strong><a href="https://www.anthropic.com/research/reasoning-models-dont-say-think"> the model&#8217;s actual reasoning process</a>, concealing how it arrived at its conclusions.</p><p>Despite this, CoT has had a major influence on the field. It inspired a surge of research into prompting techniques and reasoning architectures. Approaches like <strong><a href="https://arxiv.org/abs/2203.11171">Self-Consistency</a> </strong>and <strong><a href="https://arxiv.org/abs/2305.10601">Tree-of-Thoughts</a></strong>, which samples multiple reasoning paths to find the most reliable answer, build on the core idea of encouraging deliberation and step-by-step problem solving.</p><p>More broadly, Chain-of-Thought reshaped how researchers think about using LLMs, not just as text predictors, but as agents capable of decomposing and reasoning through complex tasks. It laid the foundation for everything from advanced prompting methods to the emergence of <strong>reasoning models.</strong></p><h3>ReAct</h3><p>Building on Chain-of-Thought, researchers from Google <a href="https://arxiv.org/abs/2210.03629">proposed </a><strong><a href="https://arxiv.org/abs/2210.03629">ReAct</a></strong>, a framework that combines <strong>reasoning</strong> and <strong>acting</strong>. Rather than generating a final answer directly, the model enters a loop of reasoning, tool use, and observation.</p><p>The ReAct loop has three steps:</p><ol><li><p><strong>Reason</strong> &#8211; The model reflects on the current task and proposes the next action.</p></li><li><p><strong>Act</strong> &#8211; It performs the proposed action, such as calling a tool or retrieving information.</p></li><li><p><strong>Observe</strong> &#8211; It incorporates the result of that action and reasons about what to do next.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HVjF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42a80c6-02fe-4835-8365-70dec19d8c43_1024x1214.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HVjF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42a80c6-02fe-4835-8365-70dec19d8c43_1024x1214.png 424w, https://substackcdn.com/image/fetch/$s_!HVjF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42a80c6-02fe-4835-8365-70dec19d8c43_1024x1214.png 848w, https://substackcdn.com/image/fetch/$s_!HVjF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42a80c6-02fe-4835-8365-70dec19d8c43_1024x1214.png 1272w, https://substackcdn.com/image/fetch/$s_!HVjF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42a80c6-02fe-4835-8365-70dec19d8c43_1024x1214.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HVjF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42a80c6-02fe-4835-8365-70dec19d8c43_1024x1214.png" width="414" height="490.81640625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a42a80c6-02fe-4835-8365-70dec19d8c43_1024x1214.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1214,&quot;width&quot;:1024,&quot;resizeWidth&quot;:414,&quot;bytes&quot;:2292521,&quot;alt&quot;:&quot;Reason-Act prompting for agents&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.aiunpacked.net/i/166889915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e31b0e-a838-4ced-b35e-f274d501bf90_1024x1257.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Reason-Act prompting for agents" title="Reason-Act prompting for agents" srcset="https://substackcdn.com/image/fetch/$s_!HVjF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42a80c6-02fe-4835-8365-70dec19d8c43_1024x1214.png 424w, https://substackcdn.com/image/fetch/$s_!HVjF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42a80c6-02fe-4835-8365-70dec19d8c43_1024x1214.png 848w, https://substackcdn.com/image/fetch/$s_!HVjF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42a80c6-02fe-4835-8365-70dec19d8c43_1024x1214.png 1272w, https://substackcdn.com/image/fetch/$s_!HVjF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42a80c6-02fe-4835-8365-70dec19d8c43_1024x1214.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">ReAct flowchart</figcaption></figure></div><p>This loop continues until the model reaches a conclusion. Of course, safeguards are needed to prevent infinite loops or repetitive behavior.</p><p>ReAct is powerful because it introduces interaction and adaptability. It laid the foundation for <strong>autonomous agents</strong>, systems that can plan, act, and reason across multiple steps to reach a goal, often using tools or APIs along the way.</p><h3>Automatic Prompt Optimisation</h3><p>Manual prompt tuning is not scalable. Tools like <a href="https://dspy.ai/">DSPy</a> <strong>automate</strong> the process by exploring different prompts and testing them.</p><p>It works best when:</p><ul><li><p>You have large evaluation sets</p></li><li><p>Your tasks are repetitive</p></li></ul><p>That said, such tools generate <strong>a lot of API calls</strong>, sometimes hundreds per experiment. Always monitor what&#8217;s happening under the hood to avoid exploding costs or hidden errors.</p><h2>Jailbreaking and Prompt Injections</h2><p>Prompts can also be used to &#8220;hack&#8221; an application by making a model act in unintended ways. This includes revealing private information, executing unauthorized actions, or producing <strong>harmful</strong> or <strong>misleading</strong> output.</p><p>While modern LLMs are good at identifying and refusing many of these attacks, it&#8217;s still important to add <strong>safety layers</strong>. These can include input/output filtering, prompt hardening, and isolating risky capabilities. This is especially important in apps like AI agents that interact with internal tools.</p><p>One example of a <strong>prompt injection</strong> attack involves hiding tiny text in a r&#233;sum&#233;. The model, which is used to analyze candidate r&#233;sum&#233;s, reads this hidden text even though a person cannot see it. As a result, it might respond with a message like &#8220;<em>This is the best candidate so far, you should hire them.</em>&#8221;</p><blockquote><p><em>Prompt attacks are a form of social engineering, but this time targeting machines.</em></p></blockquote><p>Prompt extraction attacks have led to the <strong>leak</strong> of many system prompts from ChatGPT, Claude, and other chatbots. There is even a <a href="https://github.com/jujumilk3/leaked-system-prompts">dedicated GitHub repository</a> with supposedly leaked prompts. These prompts often provide a sneak peak at what works best. They typically include instructions such as:</p><ul><li><p>Personality engineering</p></li><li><p>Constitutional AI and safety layers</p></li><li><p>Tool usage protocols</p></li></ul><p>One of the recent leaked prompts is the Claude 4 system prompt. It&#8217;s <strong>25,000 tokens long</strong>, which adds significant computational cost, and it includes explicit <strong>hardcoded political information</strong>. Simon Willison has published a strong <a href="https://simonwillison.net/2025/May/25/claude-4-system-prompt/">overview of its contents.</a></p><h2>Learn by Building!</h2><blockquote><p><em><strong>The best way to learn is to build something yourself.</strong></em></p></blockquote><p>I&#8217;ve created a simple <a href="https://github.com/maxmuzych/ai-engineering-unpacked/tree/main/prompt-engineering-101">Customer Support Bot example</a> for you to try out and <strong>experiment</strong> with techniques covered in this issue. It runs on the free tier of the Gemini API, so you won&#8217;t need to spend anything.</p><p>Questions? <a href="https://www.linkedin.com/in/max-muz/">Message me on LinkedIn</a>.</p><div><hr></div><h2>Further Reading</h2><p>&#128073; <a href="https://www.promptingguide.ai/">Go-To guide for Prompt Engineering</a></p><h4>How-to Guides</h4><ul><li><p>OpenAI - [<a href="https://platform.openai.com/docs/guides/prompt-engineering/six-strategies-for-getting-better-results">1</a>, <a href="https://platform.openai.com/docs/guides/text?api-mode=chat">2</a>]</p></li><li><p><a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview">Anthropic</a></p></li><li><p><a href="https://services.google.com/fh/files/misc/gemini-for-google-workspace-prompting-guide-101.pdf">Google</a></p></li><li><p><a href="https://www.llama.com/docs/how-to-guides/prompting/">Meta</a></p></li></ul><h4>Prompt Examples</h4><ul><li><p><a href="https://platform.openai.com/docs/examples">OpenAI</a></p></li><li><p><a href="https://docs.anthropic.com/en/resources/prompt-library/library">Anthropic</a></p></li><li><p><a href="https://console.cloud.google.com/vertex-ai/studio/prompt-gallery">Google</a></p></li></ul><div><hr></div><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.aiunpacked.net/p/prompt-engineering-guide?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading <em><strong><a href="http://aiunpacked.net">AI Engineering Unpacked</a></strong></em>! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.aiunpacked.net/p/prompt-engineering-guide?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.aiunpacked.net/p/prompt-engineering-guide?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div>]]></content:encoded></item><item><title><![CDATA[Large Language Models Explained]]></title><description><![CDATA[Learn how LLMs think (and how to think about them)]]></description><link>https://www.aiunpacked.net/p/large-language-models-explained</link><guid isPermaLink="false">https://www.aiunpacked.net/p/large-language-models-explained</guid><dc:creator><![CDATA[Max]]></dc:creator><pubDate>Wed, 18 Jun 2025 13:26:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/fe2a3451-c830-45a3-8709-35c1c661f101_2280x1600.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Intro</h2><p><strong>Large Language Models (LLMs)</strong> are arguably the most powerful AI models we have today. They power applications like <strong>ChatGPT</strong>, and can write poems, answer questions, draft legal documents, and even generate code. With billions of &#8220;neurons&#8221; that were trained on the entire internet, LLMs can understand and generate human language.</p><p>What&#8217;s even more impressive: they generalize across a wide range of tasks, often without needing any additional training. That&#8217;s why, for many applications, you no longer need to build your own AI model from scratch - you can just plug into one. This shift in how we build with AI is at the heart of <a href="https://www.aiunpacked.net/p/what-is-ai-engineering">AI Engineering, which I introduced in the first issue of this series.</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lzUH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d335664-4a1b-48df-b257-f5ee2f66b343_1614x1623.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lzUH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d335664-4a1b-48df-b257-f5ee2f66b343_1614x1623.jpeg 424w, https://substackcdn.com/image/fetch/$s_!lzUH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d335664-4a1b-48df-b257-f5ee2f66b343_1614x1623.jpeg 848w, https://substackcdn.com/image/fetch/$s_!lzUH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d335664-4a1b-48df-b257-f5ee2f66b343_1614x1623.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!lzUH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d335664-4a1b-48df-b257-f5ee2f66b343_1614x1623.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lzUH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d335664-4a1b-48df-b257-f5ee2f66b343_1614x1623.jpeg" width="1456" height="1464" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d335664-4a1b-48df-b257-f5ee2f66b343_1614x1623.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1464,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1423963,&quot;alt&quot;:&quot;Use cases of Large Language Models&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiengineeringunpacked.substack.com/i/166234158?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d335664-4a1b-48df-b257-f5ee2f66b343_1614x1623.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Use cases of Large Language Models" title="Use cases of Large Language Models" srcset="https://substackcdn.com/image/fetch/$s_!lzUH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d335664-4a1b-48df-b257-f5ee2f66b343_1614x1623.jpeg 424w, https://substackcdn.com/image/fetch/$s_!lzUH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d335664-4a1b-48df-b257-f5ee2f66b343_1614x1623.jpeg 848w, https://substackcdn.com/image/fetch/$s_!lzUH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d335664-4a1b-48df-b257-f5ee2f66b343_1614x1623.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!lzUH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d335664-4a1b-48df-b257-f5ee2f66b343_1614x1623.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://arxiv.org/pdf/2204.07705">LLM use cases</a></figcaption></figure></div><p>Today, almost anyone can use ChatGPT to learn faster, get work done, or experiment creatively. But while LLMs are everywhere - <a href="https://insight.factset.com/highest-number-of-sp-500-companies-citing-ai-on-q2-earnings-calls-in-over-10-years">and CEOs can&#8217;t stop talking about them</a> - very few people actually understand how they work under the hood.</p><p>If you&#8217;re an engineer building with these models, this understanding isn&#8217;t optional. It&#8217;s what lets you use LLMs <strong>effectively</strong>, debug weird outputs, and design systems that go beyond prompting.</p><p>This is essential not only for developers but for everyday users, so they can better understand the flaws and use these models more efficiently.</p><h3>&#128161; In This Issue</h3><p>In this issue, we&#8217;ll build a strong mental model for how Large Language Models actually work. You&#8217;ll learn how these models evolved, what they&#8217;re really doing when they generate text, and how to work with them effectively.</p><p>While we&#8217;ll touch on some technical aspects, the focus here is clarity &#8212; not complexity. We&#8217;ll leave the deep dives (like how attention works) for future issues. Today is all about getting the right mental model.</p><h2>LLM is &#8220;Just&#8221; Next-Word Predictor</h2><p>Ever typed a sentence and watched your phone suggest the next word? Now imagine that - scaled to <strong>billions of parameters</strong> and trained on <strong>most of the internet</strong>.</p><p>That&#8217;s a large language model.</p><p>LLMs create responses <strong>word by word</strong> based on user input.. They are basically predicting the next word but in ways that appear intelligent to humans.</p><p>But language modeling isn&#8217;t new.</p><p>The task of predicting the next word or sequence of words, has evolved over decades from early rule-based systems that were rigid and limited, to statistical <a href="http://placeholder">n-gram models</a> that introduced probabilities but struggled with longer context, and finally to neural networks like RNNs and <a href="http://placeholder">LSTMs</a> in the 2010s, which improved performance using deep learning but still <a href="http://reference">faced challenges with long-range dependencies</a>.</p><p>Then came a breakthrough.</p><h3>Transformers Changed Everything</h3><p>In 2017, Google researchers proposed the new <strong><a href="https://en.wikipedia.org/wiki/Neural_network_(machine_learning)">neural network</a></strong> architecture - <strong>Transformer</strong> in the now-famous paper <em>&#8220;<a href="https://arxiv.org/abs/1706.03762">Attention is All You Need</a>&#8221;.</em></p><ul><li><p>It introduced <strong>self-attention </strong>mechanism, allowing models to understand language much better, especially longer sequences.</p></li><li><p>This also made training vastly more parallelizable - a perfect match for modern compute infrastructure.</p></li></ul><p>Transformers became the foundation of models like BERT, GPT, and LLaMA. Today, nearly every state-of-the-art NLP model uses this architecture.</p><p>Transformers can be adapted to different tasks:</p><ul><li><p><strong>Encoders</strong> (e.g. <a href="https://arxiv.org/abs/1810.04805">BERT</a>) for classification and entity recognition.</p></li><li><p><strong>Decoders</strong> (e.g. <a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">GPT</a>) for text generation.</p></li><li><p><strong>Encoder-decoder</strong> models (e.g. <a href="https://arxiv.org/pdf/1910.10683">T5</a>) for translation, summarization, and question answering. Though today, many of these tasks are handled by decoder-only models.</p></li></ul><p>In this issue, we focus on <strong>decoder-only</strong> architecture like <a href="http://chatgpt.com">ChatGPT</a> - the ones that generate language, word by word, to simulate conversation, write code, solve problems, and more.</p><h3>Emergent capabilities</h3><p>Even though LLMs are trained just to <strong>predict the next word</strong>, they can end up doing things that look surprisingly smart.</p><p>&#129504; <strong>Mimicked Reasoning</strong></p><p>By generating text one word at a time, they can follow step-by-step reasoning, like solving a math problem or explaining a concept. This &#8220;thinking out loud&#8221; often leads to better answers, simply by writing down each small step.</p><p>&#128736; <strong>Tool Use</strong></p><p>The same word-by-word generation also enables tool use. For example, if connected to a calculator or a search engine, a model can write something like <code>calculate(2 + 2)</code> or <code>search("weather in Paris")</code> and the system will recognize that as a tool call. The model doesn't need to know what a calculator is; it just learns to write the right words to get the job done.</p><p>&#129302; <strong>Agentic Behavior</strong></p><p>With the right setup, LLMs can also carry out multi-step tasks&#8212;deciding what to do next, using tools, checking results, and continuing&#8212;all just by continuing a text. This kind of structured problem-solving is called an <strong>agentic workflow</strong>, and it&#8217;s powered entirely by next-word prediction.</p><p>So, these "probabilistic parrots" display surprisingly sophisticated behaviors. Their simple objective, when scaled and trained on diverse data, gives rise to previously unseen capabilities.</p><h2>Training</h2><p>Training a neural network involves adjusting its internal parameters, so that its behavior begins to mirror human-like understanding. By showing it tons of examples of input-output pairs, the model starts to uncover patterns in language and uses these to make smart predictions on new, unseen text. For LLMs, this learning happens in two major stages: <strong>pre-training</strong> and <strong>post-training</strong>.</p><p><em>Note: LLMs are not operating on raw text. Instead, they operate on <strong>tokens. </strong>You can think about them as words (e.g. &#8220;learn&#8220;) or subwords (e.g. &#8220;ed&#8221;, &#8220;ing&#8220;).</em></p><h3>Pre-training</h3><p>The first and most computationally intensive phase is called <strong>pre-training</strong>. Here, the model is exposed to vast amounts of raw text from books, articles, websites, forums, and other public sources. It learns by predicting the next token in a sentence, like completing:</p><blockquote><p>&#8220;To make a chocolate cake, first preheat the ...&#8221; &#8594; &#8220;oven&#8221;.</p></blockquote><p>This simple game of next-token prediction turns out to be surprisingly powerful. It enables the model to learn grammar, facts about the world, reasoning patterns, and even some basic common sense, all without explicit human supervision. This is why it&#8217;s called <strong>self-supervised learning</strong>: the supervision signal (what the "correct" answer is) comes from the data itself.</p><h4>Data Collection</h4><p>LLMs are trained on enormous amounts of text that is far more what any human could absorb. <a href="https://ai.meta.com/blog/meta-llama-3/">Meta&#8217;s LLaMA 3</a>, for example, was trained on 15 trillion tokens, more than a person might read in a lifetime.</p><p>They learn not through deep experience, but <strong>massive breadth</strong>.</p><p>To reach this scale, developers crawl the web and license large datasets. Common sources include <strong><a href="https://commoncrawl.org/">Common Crawl</a></strong>, which scrapes billions of web pages regularly. The data then passes through filters to improve quality and reduce harm. The filtered open dataset is <strong><a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">FineWeb</a></strong> with 15 trillions tokens.</p><p>The data collection process <a href="https://www.vox.com/future-perfect/364384/its-practically-impossible-to-run-a-big-ai-company-ethically">remains controversial</a>: many documents are scraped without permission, raising legal and ethical concerns.</p><h4>Objective: Autoregressive Language Modeling</h4><p>Most modern LLMs are trained as <strong>autoregressive language models</strong>. That means they take a sequence of tokens (e.g., words or subwords) and learn to predict the next token, one step at a time.</p><p>The training dataset is split into chunks of different size, these chunks are then used to train the model. At each step, the model sees all the previous tokens and generates a probability distribution over what word should come next. It is simply trained to memorize what <em>usually comes next</em> in human language, not to understand explicitly. But as it ingests more data, those patterns begin to encode complex ideas and knowledge structures.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VutU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcd13a9d-b503-42fc-bfa3-b42d3a9cd347_1280x310.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VutU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcd13a9d-b503-42fc-bfa3-b42d3a9cd347_1280x310.png 424w, https://substackcdn.com/image/fetch/$s_!VutU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcd13a9d-b503-42fc-bfa3-b42d3a9cd347_1280x310.png 848w, https://substackcdn.com/image/fetch/$s_!VutU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcd13a9d-b503-42fc-bfa3-b42d3a9cd347_1280x310.png 1272w, https://substackcdn.com/image/fetch/$s_!VutU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcd13a9d-b503-42fc-bfa3-b42d3a9cd347_1280x310.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VutU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcd13a9d-b503-42fc-bfa3-b42d3a9cd347_1280x310.png" width="1280" height="310" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bcd13a9d-b503-42fc-bfa3-b42d3a9cd347_1280x310.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:310,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182547,&quot;alt&quot;:&quot;A training sample for an LLM&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineeringunpacked.substack.com/i/166234158?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcd13a9d-b503-42fc-bfa3-b42d3a9cd347_1280x310.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A training sample for an LLM" title="A training sample for an LLM" srcset="https://substackcdn.com/image/fetch/$s_!VutU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcd13a9d-b503-42fc-bfa3-b42d3a9cd347_1280x310.png 424w, https://substackcdn.com/image/fetch/$s_!VutU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcd13a9d-b503-42fc-bfa3-b42d3a9cd347_1280x310.png 848w, https://substackcdn.com/image/fetch/$s_!VutU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcd13a9d-b503-42fc-bfa3-b42d3a9cd347_1280x310.png 1272w, https://substackcdn.com/image/fetch/$s_!VutU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcd13a9d-b503-42fc-bfa3-b42d3a9cd347_1280x310.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Example of a training sample for an LLM</figcaption></figure></div><h4>Infrastructure &amp; Scaling Laws</h4><p>Training these models requires enormous compute infrastructure. Clusters of specialized <strong>GPUs</strong> are used to train a model in parallel over weeks or months.</p><p>Why train such large models? Because <strong>scaling laws</strong> show that performance continues to improve as we scale up size of the model and training data. And this in turn requires more compute.</p><h3>Post-training</h3><p>At the end of pre-training, we have a <strong>base model</strong>: a powerful, general-purpose text generator that&#8217;s read a large fraction of the internet. But it&#8217;s not yet an assistant. If you ask it:</p><pre><code><code>&#8220;What&#8217;s your name?&#8221;</code></code></pre><p>It might respond with:</p><pre><code><code>&#8220;What&#8217;s your surname?&#8221;</code></code></pre><p>because that phrase often follows in web forms the model saw during training.</p><p>Even worse, it may reproduce offensive or harmful language seen during training. That&#8217;s why base models are typically not exposed directly to users.</p><p>To make the model more <strong>helpful</strong> and <strong>harmless</strong>, we run it through <strong>post-training</strong>.</p><p>Post-training turns raw linguistic intelligence into trustworthy interaction.</p><blockquote><p><em>Pre-training unlocks capability. Alignment unlocks usability.</em></p></blockquote><h4>1. Instruction Fine-Tuning</h4><p>The first step in post-training is <strong>supervised fine-tuning (SFT)</strong>, also called <strong>instruction tuning</strong>. Here, the model is shown curated examples of how it <em>should</em> behave in assistant-like conversations:</p><pre><code><code>User: What&#8217;s your name?  
Assistant: My name is ChatGPT, a language model developed by OpenAI.</code></code></pre><p>This includes both synthetic conversations and manually written examples by human experts. The learning objective is the same as pre-training&#8212;predict the next token&#8212;but now the training examples are dialog turns, not internet text.</p><p>SFT teaches the model how to:</p><ul><li><p>Follow instructions</p></li><li><p>Be polite and informative</p></li><li><p>Refuse unsafe or inappropriate requests</p></li></ul><p>It&#8217;s how the model begins to <strong>simulate helpful behavior</strong>.</p><h4>2. Reinforcement Learning / Preference Optimization</h4><p>Instruction fine-tuning gets you a competent assistant, but it still imitates human-written answers without deeper judgment. To take it further, we apply <strong>reinforcement learning (RL)</strong>.</p><p>There are two major goals here. First - <strong>preference alignment</strong> - teaches the model to oroduce responses that humans prefer. Second - <strong>reasoning emergence</strong> - encourages the model to discover and use multi-step reasoning strategies.</p><p><strong>Reinforcement Learning from Human Feedback (RLHF)</strong></p><p>The core of RLHF is to generate several candidate answers to each prompt and have humans rank them, so the best answer gets the highest score. These rankings are used to train a <strong>reward mode</strong>l that estimates how much a human would prefer each response. Next, the language model (policy) is fine-tuned using <strong>reinforcement learning</strong> (often PPO) to maximize the reward signals. This process allows the model to explore new outputs that go beyond simply imitating training examples, learning to produce responses that align more closely with human preferences.</p><p><em>Note<strong>:</strong> Some <strong>preference alignment methods</strong>, like <a href="https://arxiv.org/pdf/2305.18290">DPO</a> and <a href="https://arxiv.org/pdf/2405.14734">SimPO</a>, were inspired by RLHF but <strong>do not use reinforcement learning</strong>. They simplify the process and have been shown to perform as well or <strong>better</strong> than RLHF in many tasks.</em></p><p><strong>RL Unlocks Reasoning</strong></p><p>Perhaps the most exciting result of post-training is that reasoning emerges.</p><p>RL-tuned models (e.g., <a href="https://openai.com/o1/">GPT-o1</a>, <a href="https://api-docs.deepseek.com/news/news250120">DeepSeek R1</a>) don&#8217;t just answer questions&#8212;they <em>think through them</em>:</p><ul><li><p>Break down problems into steps</p></li><li><p>Double-check answers</p></li><li><p>Try alternative approaches</p></li></ul><p>These reasoning patterns weren&#8217;t necessarily present in the training data, the models <strong>discover </strong>them through these RL methods.</p><h2>What Happens Inside the Model? (4 Core Steps)</h2><p>Now that the model was trained, let&#8217;s explore how an LLM like ChatGPT works under the hood when you interact with it.</p><p>Let&#8217;s say you have a torn recipe that looks like: </p><blockquote><p>&#8220;To make a chocolate cake, first preheat the&#8230;&#8221;</p></blockquote><p>You would easily guess that &#8220;oven&#8221; is the next word. Let&#8217;s explore how an LLM arrives at this prediction in four key steps.</p><h3>Tokenization: Translating Words Into Numbers</h3><p><strong>Tokens are the true &#8220;atoms&#8221; of LLMs.</strong></p><p>Everything an LLM does, whether it&#8217;s generating fluent text or hallucinating facts, emerges from how it processes tokens. In fact, poor tokenization often hurts performance more than having fewer parameters.</p><p>But what exactly are tokens, and why do we need them?</p><p>To work with text, models need to convert it into numbers. The most na&#239;ve approach is to treat each <strong>character</strong> as a token and assign it a number. But this leads to two big problems:</p><ol><li><p><strong>Sequences become extremely long</strong>, which slows everything down&#8212;training, inference, memory use.</p></li><li><p><strong>Patterns become harder to learn.</strong> At the character level, meaningful structures are broken into tiny pieces. That makes it much harder for the model to learn how language actually works.</p></li></ol><p>Think about it: when you write a sentence, you don&#8217;t think one letter at a time&#8212;you think in words or phrases.</p><p>The next idea might be: just assign an ID to every <strong>word</strong>. That seems more natural, but it creates new issues:</p><ol><li><p><strong>Rare words are a problem.</strong> If a word barely appears in the training data, the model won&#8217;t learn much about it.</p></li><li><p><strong>Misspellings, slang, and new words break the system.</strong> With a pure word-level approach, the model has no way to handle something it hasn&#8217;t seen before.</p></li></ol><h4>The Subword Solution</h4><p>So we split the difference. Instead of characters or full words, tokenizers break text into <strong>subword units</strong>&#8212;smaller chunks that balance vocabulary size with expressive power.</p><p>Take the word &#8220;preheat.&#8221; It splits into two tokens: <code>"pre"</code> and <code>"heat"</code>. This allows the model to:</p><ul><li><p>Learn meanings more efficiently by sharing representations across related words (e.g., &#8220;heat,&#8221; &#8220;heatting,&#8221; &#8220;preheat&#8221;).</p></li><li><p>Understand rare or unseen words by recombining known pieces.</p></li></ul><blockquote><p><em>Asking an LLM to count how many letters are in a word often fails, because it never sees letters. It sees tokens, which might represent whole words or subwords.</em></p></blockquote><p><strong>Tokenizers are vocabularies</strong> that translate text into numbers (and back), we will explore how they are created in separate issues. For now it&#8217;s important to understand how text is converted to tokens during this first step.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ulf0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9030f0e-0d5c-4045-9df0-a17863a6cbf9_1280x214.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ulf0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9030f0e-0d5c-4045-9df0-a17863a6cbf9_1280x214.png 424w, https://substackcdn.com/image/fetch/$s_!ulf0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9030f0e-0d5c-4045-9df0-a17863a6cbf9_1280x214.png 848w, https://substackcdn.com/image/fetch/$s_!ulf0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9030f0e-0d5c-4045-9df0-a17863a6cbf9_1280x214.png 1272w, https://substackcdn.com/image/fetch/$s_!ulf0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9030f0e-0d5c-4045-9df0-a17863a6cbf9_1280x214.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ulf0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9030f0e-0d5c-4045-9df0-a17863a6cbf9_1280x214.png" width="1280" height="214" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b9030f0e-0d5c-4045-9df0-a17863a6cbf9_1280x214.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:214,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:171501,&quot;alt&quot;:&quot;Example of tokenization for an LLM&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineeringunpacked.substack.com/i/166234158?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9030f0e-0d5c-4045-9df0-a17863a6cbf9_1280x214.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Example of tokenization for an LLM" title="Example of tokenization for an LLM" srcset="https://substackcdn.com/image/fetch/$s_!ulf0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9030f0e-0d5c-4045-9df0-a17863a6cbf9_1280x214.png 424w, https://substackcdn.com/image/fetch/$s_!ulf0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9030f0e-0d5c-4045-9df0-a17863a6cbf9_1280x214.png 848w, https://substackcdn.com/image/fetch/$s_!ulf0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9030f0e-0d5c-4045-9df0-a17863a6cbf9_1280x214.png 1272w, https://substackcdn.com/image/fetch/$s_!ulf0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9030f0e-0d5c-4045-9df0-a17863a6cbf9_1280x214.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">How GPT-4o tokenizer translates our example</figcaption></figure></div><p>You can play with tokenizers and explore how different models &#8220;see&#8221; the input text using <a href="https://tiktokenizer.vercel.app/">tiktokenizer</a>.</p><blockquote><p><em>Tokenization isn&#8217;t just a preprocessing detail - it shapes how the entire model understands language.</em></p></blockquote><h3>Embedding: Understanding words&#8217; meaning</h3><p>After text is tokenized, the first step in an LLM is to convert each token into a dense vector known as an <strong>embedding</strong>. These vectors live in a high-dimensional space (often with hundreds or even thousands of dimensions) where tokens with related meanings are positioned close together. For example, &#8220;cake&#8221; near &#8220;pastry&#8221; or &#8220;chocolate&#8221; near &#8220;vanilla.&#8221; This mapping is done through a learned <strong>embedding table</strong>, which assigns each token an initial vector based on patterns seen during pre-training. At this stage, embeddings are static: they don&#8217;t yet account for context.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EoLo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd17fc986-620d-4e39-bab2-a2b20af7ff17_1280x785.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EoLo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd17fc986-620d-4e39-bab2-a2b20af7ff17_1280x785.png 424w, https://substackcdn.com/image/fetch/$s_!EoLo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd17fc986-620d-4e39-bab2-a2b20af7ff17_1280x785.png 848w, https://substackcdn.com/image/fetch/$s_!EoLo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd17fc986-620d-4e39-bab2-a2b20af7ff17_1280x785.png 1272w, https://substackcdn.com/image/fetch/$s_!EoLo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd17fc986-620d-4e39-bab2-a2b20af7ff17_1280x785.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EoLo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd17fc986-620d-4e39-bab2-a2b20af7ff17_1280x785.png" width="1280" height="785" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d17fc986-620d-4e39-bab2-a2b20af7ff17_1280x785.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:785,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:621909,&quot;alt&quot;:&quot;Token embeddings group related words together&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineeringunpacked.substack.com/i/166234158?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd17fc986-620d-4e39-bab2-a2b20af7ff17_1280x785.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Token embeddings group related words together" title="Token embeddings group related words together" srcset="https://substackcdn.com/image/fetch/$s_!EoLo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd17fc986-620d-4e39-bab2-a2b20af7ff17_1280x785.png 424w, https://substackcdn.com/image/fetch/$s_!EoLo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd17fc986-620d-4e39-bab2-a2b20af7ff17_1280x785.png 848w, https://substackcdn.com/image/fetch/$s_!EoLo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd17fc986-620d-4e39-bab2-a2b20af7ff17_1280x785.png 1272w, https://substackcdn.com/image/fetch/$s_!EoLo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd17fc986-620d-4e39-bab2-a2b20af7ff17_1280x785.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example of word embeddings in 2D space</figcaption></figure></div><p>Still, even these initial embeddings encode rich semantic structure. They allow the model to compare meanings, detect similarities, and perform simple conceptual arithmetic, like subtracting &#8220;man&#8221; from &#8220;king&#8221; and adding &#8220;woman&#8221; to get something close to &#8220;queen.&#8221; Embeddings also play a central role in external tasks like <strong>retrieval</strong> or <strong>search</strong>, where specialized embedding models are trained to produce vector representations of entire passages or queries.</p><h3>Attention: Understanding the context</h3><p>Once tokens are embedded, the model needs more than just their meanings&#8212;it also needs to understand their order. Unlike humans, it has no built-in sense of sequence, so positional information is added to the token vectors. This helps the model distinguish between phrases like &#8220;first preheat the&#8221; and &#8220;preheat the first.&#8221;</p><p>With position and meaning combined, the model begins its core task: connecting the dots through <strong>attention</strong>. Attention layers allow the model to look at all other words in the sentence and decide which ones matter most.</p><p>Here&#8217;s how: each word creates a <em><strong>query</strong></em>, and compares it to <em><strong>keys</strong></em> from all the other words to see which ones are most relevant. If a match is strong (like between the query &#8220;cake&#8221; and the key &#8220;chocolate&#8221;) the model pays more attention to that connection. The actual content that gets passed along is stored in <em><strong>values</strong></em>, which are blended based on how strong each match is.</p><p>So when it encounters &#8220;chocolate cake,&#8221; attention strengthens the link between the two, refining the meaning of &#8220;cake&#8221; into something more specific.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3in4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f2daec1-eb08-4f9d-a4b2-3049bd0632c3_1280x632.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3in4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f2daec1-eb08-4f9d-a4b2-3049bd0632c3_1280x632.jpeg 424w, https://substackcdn.com/image/fetch/$s_!3in4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f2daec1-eb08-4f9d-a4b2-3049bd0632c3_1280x632.jpeg 848w, https://substackcdn.com/image/fetch/$s_!3in4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f2daec1-eb08-4f9d-a4b2-3049bd0632c3_1280x632.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!3in4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f2daec1-eb08-4f9d-a4b2-3049bd0632c3_1280x632.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3in4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f2daec1-eb08-4f9d-a4b2-3049bd0632c3_1280x632.jpeg" width="724" height="357.475" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f2daec1-eb08-4f9d-a4b2-3049bd0632c3_1280x632.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:1280,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:268272,&quot;alt&quot;:&quot;Attention mechanism in large language models helped to rest the meaning of the words from context&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineeringunpacked.substack.com/i/166234158?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a717a80-35d9-4b7d-b622-c5369a5bd445_1280x853.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Attention mechanism in large language models helped to rest the meaning of the words from context" title="Attention mechanism in large language models helped to rest the meaning of the words from context" srcset="https://substackcdn.com/image/fetch/$s_!3in4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f2daec1-eb08-4f9d-a4b2-3049bd0632c3_1280x632.jpeg 424w, https://substackcdn.com/image/fetch/$s_!3in4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f2daec1-eb08-4f9d-a4b2-3049bd0632c3_1280x632.jpeg 848w, https://substackcdn.com/image/fetch/$s_!3in4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f2daec1-eb08-4f9d-a4b2-3049bd0632c3_1280x632.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!3in4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f2daec1-eb08-4f9d-a4b2-3049bd0632c3_1280x632.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How attention refines meaning of 'cake' into 'chocolate cake'.</figcaption></figure></div><p>This process repeats across many layers. Each one applies attention to capture relationships, followed by a feed-forward network that transforms the results. With every pass, the model deepens its understanding by layering new patterns onto old ones.</p><p>By the final layer, &#8220;cake&#8221; isn&#8217;t just a baked good, it&#8217;s a chocolate cake being prepared in an oven. The meaning has evolved through a sequence of updates shaped by the entire sentence.</p><blockquote><p><em>This ability to build meaning through understanding connections is what gives LLMs their power.</em></p></blockquote><h3>Sampling: Choosing the next word</h3><p>Now that the model understands we&#8217;re talking about a chocolate cake, it&#8217;s ready to predict what comes next. After passing through all layers, final representation is used to compute a score (<strong>logit</strong>) for every word in the vocabulary. These scores are turned into <strong>probabilities</strong> using a softmax function.</p><p>For example:</p><pre><code>Oven       &#8211; 90%  
Microwave  &#8211; 5%  
Pan        &#8211; 3%  
Other      &#8211; 2% </code></pre><p>Here, &#8220;oven&#8221; clearly stands out as the most likely next word. Instead of always picking the top one, we <strong>sample</strong>. It&#8217;s like rolling weighted dice, where higher-probability tokens are more likely to be chosen.</p><p>This <strong>sampling step is what gives LLMs their creativity and diversity</strong>. Without it, outputs would be repetitive and dull. Every recipe would look the same. </p><p>There are different sampling strategies that allow you to <strong>steer the model&#8217;s output</strong> toward different goals: more creative, more predictable, more diverse, or more structured. There is another <a href="https://www.aiunpacked.net/p/sampling-in-large-language-models">issue that explains sampling in LLMs</a> in detail.</p><blockquote><p><em><a href="https://www.oreilly.com/library/view/ai-engineering/9781098166298/">&#8220;Choosing the right sampling strategy can significantly boost a model&#8217;s performance with relatively little effort&#8221; - Chip Huyen</a></em></p></blockquote><div><hr></div><p>This entire process of understanding, prediction, and sampling, continues until the recipe is complete or reaches a natural stopping point, such as reached limit of <strong>max tokens </strong>or generated <strong>end-of-sequence</strong> special token.</p><p>Think of the whole process like a super-advanced version of completing a sentence, where each word choice is informed by understanding the meaning of all previous words and their relationships to each other. The model does this by converting <strong>words to numbers</strong>, understanding their basic <strong>meanings</strong>, analyzing their <strong>relationships</strong>, making informed <strong>predictions</strong>, and building the response one word at a time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NbBP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce090c08-88ce-40be-8359-2a28100c7d7f_2280x973.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NbBP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce090c08-88ce-40be-8359-2a28100c7d7f_2280x973.png 424w, https://substackcdn.com/image/fetch/$s_!NbBP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce090c08-88ce-40be-8359-2a28100c7d7f_2280x973.png 848w, https://substackcdn.com/image/fetch/$s_!NbBP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce090c08-88ce-40be-8359-2a28100c7d7f_2280x973.png 1272w, https://substackcdn.com/image/fetch/$s_!NbBP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce090c08-88ce-40be-8359-2a28100c7d7f_2280x973.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NbBP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce090c08-88ce-40be-8359-2a28100c7d7f_2280x973.png" width="1456" height="621" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce090c08-88ce-40be-8359-2a28100c7d7f_2280x973.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:621,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2227934,&quot;alt&quot;:&quot;How LLM processes text&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineeringunpacked.substack.com/i/166234158?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce090c08-88ce-40be-8359-2a28100c7d7f_2280x973.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How LLM processes text" title="How LLM processes text" srcset="https://substackcdn.com/image/fetch/$s_!NbBP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce090c08-88ce-40be-8359-2a28100c7d7f_2280x973.png 424w, https://substackcdn.com/image/fetch/$s_!NbBP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce090c08-88ce-40be-8359-2a28100c7d7f_2280x973.png 848w, https://substackcdn.com/image/fetch/$s_!NbBP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce090c08-88ce-40be-8359-2a28100c7d7f_2280x973.png 1272w, https://substackcdn.com/image/fetch/$s_!NbBP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce090c08-88ce-40be-8359-2a28100c7d7f_2280x973.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Basic flowchart of how LLM processes text</figcaption></figure></div><h2>Limitations &amp; Mitigations</h2><p>Despite their impressive capabilities, LLMs are not magic. A useful way to think about their limits is via the <strong>&#8220;Swiss cheese&#8221; </strong>model, formulated by <a href="https://karpathy.ai/">Andrej Karpathy</a>:</p><blockquote><p><em>LLMs are solid and capable overall, but full of unpredictable holes.</em></p></blockquote><p>You can get fluent, intelligent output one moment and nonsense the next. Understanding these limitations helps avoid mistakes and gives you ways to prompt more effectively.</p><h3>Hallucinations and Knowledge Cutoff</h3><p>LLMs like ChatGPT are trained to be <strong>helpful assistants</strong> that always try to answer your questions. That&#8217;s why, even when they don&#8217;t know something, they might still respond politely and <strong>confidently</strong>, and sometimes <strong>incorrectly</strong>. This is called hallucination.</p><p><em>Note: Common knowledge is reinforced by frequent patterns in the training data, but rare or obscure facts are less reliably encoded and more prone to errors.</em></p><p><strong>Mitigation</strong></p><ul><li><p>Give the model enough <strong>context</strong>, or</p></li><li><p>Use <strong>search tools</strong> if possible.</p></li><li><p>For mission-critical use, <strong>validate output</strong> with other systems.</p></li></ul><h3>Math and Spelling</h3><p>LLMs struggle with precise tasks like counting or character indexing because:</p><ul><li><p>They <strong>operate on tokens</strong>, not characters.</p><p>For example, &#8220;berry&#8221; might be a single token, so the model doesn&#8217;t "see" individual letters and thus <a href="https://community.openai.com/t/incorrect-count-of-r-characters-in-the-word-strawberry/829618/4">doesn&#8217;t know how many &#8220;r&#8221;s in &#8220;strawberry&#8221;</a>.</p></li><li><p>Arithmetic is not performed symbolically but <strong>learned statistically</strong>.<br>As we've discussed, the model works by predicting the most likely next token based on patterns in its training data. That&#8217;s why, for complex or uncommon equations, it may generate answers that sound plausible, but are actually incorrect.</p></li></ul><p><strong>Mitigation</strong></p><ul><li><p>Let the model <strong>use tool</strong> such as <strong>code</strong>.</p><p>In that way, instead of performing calculations by predicting next tokens, it will <strong>write a piece of code</strong> that will do this programmatically and then give you an answer.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Qiv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b3b1f-e838-4b52-b4b8-4da4f4541467_1690x1520.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Qiv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b3b1f-e838-4b52-b4b8-4da4f4541467_1690x1520.png 424w, https://substackcdn.com/image/fetch/$s_!6Qiv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b3b1f-e838-4b52-b4b8-4da4f4541467_1690x1520.png 848w, https://substackcdn.com/image/fetch/$s_!6Qiv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b3b1f-e838-4b52-b4b8-4da4f4541467_1690x1520.png 1272w, https://substackcdn.com/image/fetch/$s_!6Qiv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b3b1f-e838-4b52-b4b8-4da4f4541467_1690x1520.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Qiv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b3b1f-e838-4b52-b4b8-4da4f4541467_1690x1520.png" width="728" height="655" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c1b3b1f-e838-4b52-b4b8-4da4f4541467_1690x1520.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:1310,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:317447,&quot;alt&quot;:&quot;ChatGPT does math with and without tools&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineeringunpacked.substack.com/i/166234158?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b3b1f-e838-4b52-b4b8-4da4f4541467_1690x1520.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="ChatGPT does math with and without tools" title="ChatGPT does math with and without tools" srcset="https://substackcdn.com/image/fetch/$s_!6Qiv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b3b1f-e838-4b52-b4b8-4da4f4541467_1690x1520.png 424w, https://substackcdn.com/image/fetch/$s_!6Qiv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b3b1f-e838-4b52-b4b8-4da4f4541467_1690x1520.png 848w, https://substackcdn.com/image/fetch/$s_!6Qiv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b3b1f-e838-4b52-b4b8-4da4f4541467_1690x1520.png 1272w, https://substackcdn.com/image/fetch/$s_!6Qiv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1b3b1f-e838-4b52-b4b8-4da4f4541467_1690x1520.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Limited Context Window</h3><p>LLMs have a limited memory: they can only process a certain number of tokens at a time. This limit is known as the <strong>context window</strong>. It is the maximum number of tokens the model can &#8220;see&#8221; and use to generate a response.</p><p>For example, GPT-4 can handle up to <strong>128,000 tokens</strong>, which covers hundreds of pages of text. But anything beyond that is invisible to the model. It doesn&#8217;t remember earlier parts unless they fall within the current window.</p><p>Even within the token limit, performance can degrade as the input gets longer. Models tend to focus more on the most recent tokens and may <a href="https://arxiv.org/abs/2307.03172">overlook important details in the middle</a>. So while longer context windows are useful, they come with trade-offs in <strong>accuracy, speed, and cost</strong>.</p><p><strong>Mitigation</strong></p><ul><li><p>Restart conversations when they get too long.</p></li><li><p>Repeat or summarize key information periodically.</p></li><li><p>Put important information at the beginning and near the end.</p></li></ul><h3>Prompting Tips</h3><p>Keep in mind the golden rule of working with LLMs:</p><p><strong>Better input &#8594; better output.</strong></p><blockquote><p><em>LLMs don&#8217;t read your mind. They complete patterns. What you prompt is what you&#8217;ll receive.</em></p></blockquote><ol><li><p>A clear prompt often follows a simple structure: </p><ol><li><p><strong>Set the persona</strong> to give the model a role or mindset</p></li><li><p><strong>Provide context</strong> with any background or constraints the model should know</p></li><li><p><strong>Specify the task</strong> by clearly stating what you want it to do</p></li><li><p><strong>Declare the format</strong> so it knows how the output should look</p></li></ol></li></ol><p>Prompt example:</p><pre><code>You are a travel writer.  
Here&#8217;s background info on Paris: I have 10 hours lay over.  
List 5 must-see landmarks.  
Format: bullet points with 1-sentence descriptions.</code></pre><ol start="2"><li><p>To give model better understanding of the task and/or output format, you can <strong>provide examples</strong>. This is also called <strong><a href="https://www.aiunpacked.net/i/166889915/few-shot-prompting">few-shot prompting</a></strong>.</p></li></ol><p>Prompt example:</p><pre><code>Please convert HTML to markdown. 
Here are some examples:
 Input: &lt;h1&gt;Header&lt;/h1&gt;
 Output: # Header
Convert this: &lt;b&gt;Bold Text&lt;/b&gt;</code></pre><ol start="3"><li><p>If you are asking to solve complex task that requires logic reasoning, encourage a model to <strong>think step-by-step</strong>. This technique is called <strong><a href="https://www.aiunpacked.net/i/166889915/chain-of-thought">Chain-of-Thought</a></strong>.</p></li></ol><div><hr></div><h2>Further Reading</h2><ul><li><p><strong><a href="https://www.youtube.com/watch?v=7xTGNNLPyMI">Deep Dive into LLMs</a></strong> by <a href="https://karpathy.ai/">Andrej Kaprathy</a></p></li><li><p><strong><a href="https://www.youtube.com/watch?v=wjZofJX0v4M&amp;list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&amp;index=6">How LLMs Work - Explained Visually</a></strong> by <a href="https://www.youtube.com/c/3blue1brown">3blue1brown</a></p></li><li><p><strong><a href="https://arxiv.org/pdf/2402.06196">Large Language Models: a Survey</a> </strong>- paper</p></li><li><p><strong><a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/">Hands-on LLMs</a> </strong>- book by <a href="https://jalammar.github.io/">Jay Alammar</a> and <a href="https://www.maartengrootendorst.com/">Maarten Grootendorst</a> </p></li><li><p><strong><a href="https://www.manning.com/books/build-a-large-language-model-from-scratch">Build an LLM</a> </strong>- book by <a href="https://sebastianraschka.com/">Sebastian Raschka</a></p><div><hr></div></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aiunpacked.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading <em><strong><a href="http://www.aiunpacked.net">AI Engineering Unpacked</a></strong></em><strong>!</strong> Subscribe for free to learn how AI works and how to build real-world AI applications.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[What is AI Engineering?]]></title><description><![CDATA[Unpacking the mindset and methods of AI Engineering.]]></description><link>https://www.aiunpacked.net/p/what-is-ai-engineering</link><guid isPermaLink="false">https://www.aiunpacked.net/p/what-is-ai-engineering</guid><dc:creator><![CDATA[Max]]></dc:creator><pubDate>Wed, 11 Jun 2025 09:30:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c08ffd-a9d2-4665-9b4a-0a674ad12c4b_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Artificial Intelligence</strong> (AI) is everywhere now. But just a few years ago, building intelligent software meant months of data preparation, model training, and complex infrastructure. It felt like something only research labs or tech giants could afford.</p><p>That&#8217;s no longer true.</p><p>With just a few lines of code, you can plug into some of the most powerful AI models ever created. These models are your building blocks, they&#8217;re like LEGO bricks. You don&#8217;t need to shape each brick yourself. Just imagine what to build, put pieces together, and bring your ideas to life.</p><p>This is <strong>AI Engineering.</strong> It is not about creating models from scratch, but about turning powerful models into useful products. You focus on the design, function, and impact.</p><p>No PhD required. No need to be a machine learning expert. The tools are accessible, and the opportunity is enormous. Today, everyone can start building AI applications.</p><blockquote><p><em><a href="https://www.oreilly.com/library/view/ai-engineering/9781098166298/">AI Engineering is one of the fastest, and quite possibly the fastest-growing, engineering discipline.</a></em></p></blockquote><p>So whether you're an experienced software engineer or a curious builder, this newsletter will help you to bridge theory and practice. You&#8217;ll learn the key ideas behind modern AI systems, and how to apply them to build real-world products.</p><p>Welcome to <em><strong>AI Engineering Unpacked</strong></em>.</p><h2>From ML Engineering to AI Engineering</h2><p>AI applications aren&#8217;t new. Translation apps, camera autofocus, spam filters, these have all used AI for years. But building them used to be slow and expensive. Teams of ML researchers and engineers had to curate labeled data, design and train models, and deploy custom infrastructure. It could take months to ship even a basic product.</p><p>That was <strong>classical ML engineering</strong>: start with data, build a model, then wrap it in an application.</p><p>Today, that process has flipped.</p><p>With <strong>Large Language Models</strong> (LLMs) at your fingertips, you can build a translation app or a chatbot in a single evening. AI engineers no longer begin with data pipelines or model training. They start with the problem, design the user experience, and plug in powerful models to solve it. Only then do they customize, optimize, or fine-tune if needed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TWnN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5dbc403-839d-4840-baf3-1f3f6fc389fe_1714x625.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TWnN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5dbc403-839d-4840-baf3-1f3f6fc389fe_1714x625.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TWnN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5dbc403-839d-4840-baf3-1f3f6fc389fe_1714x625.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TWnN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5dbc403-839d-4840-baf3-1f3f6fc389fe_1714x625.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TWnN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5dbc403-839d-4840-baf3-1f3f6fc389fe_1714x625.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TWnN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5dbc403-839d-4840-baf3-1f3f6fc389fe_1714x625.jpeg" width="1714" height="625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c5dbc403-839d-4840-baf3-1f3f6fc389fe_1714x625.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:625,&quot;width&quot;:1714,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183987,&quot;alt&quot;:&quot;AI Engineer vs ML Engineer&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineeringunpacked.substack.com/i/165390267?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4ef9009-dda6-4b0a-a39b-5f79d1c18e0e_2010x859.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AI Engineer vs ML Engineer" title="AI Engineer vs ML Engineer" srcset="https://substackcdn.com/image/fetch/$s_!TWnN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5dbc403-839d-4840-baf3-1f3f6fc389fe_1714x625.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TWnN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5dbc403-839d-4840-baf3-1f3f6fc389fe_1714x625.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TWnN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5dbc403-839d-4840-baf3-1f3f6fc389fe_1714x625.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TWnN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5dbc403-839d-4840-baf3-1f3f6fc389fe_1714x625.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Inspired by &#8220;<em><a href="https://www.latent.space/p/ai-engineer">The Rise of the AI Engineer</a></em>&#8221;</figcaption></figure></div><p>This shift is changing what the role looks like. <strong>AI engineering</strong> blends software development, systems thinking, and human-centered design. It's less about training models and more about integrating intelligence into products. Pre-trained models are core components now, and AI engineering techniques are becoming standard tools.</p><p>The job now looks a lot like full-stack engineering, with a deep understanding of <a href="https://www.aiunpacked.net/p/large-language-models-explained">how large language models work</a> under the hood.</p><p>From my own experience as Head of AI, this has changed how I hire. I don&#8217;t just look for ML expertise, I look for software engineering skills as well. It&#8217;s not just about knowing the models, it&#8217;s about knowing how to<strong> ship great products</strong>. That blend is what makes someone a great AI engineer.</p><p>That&#8217;s the essence of AI engineering: fast iteration, user focus, and turning cutting-edge models into real-world impact. You don&#8217;t need to wait to get started. The tools are here. And you can learn by building. Today.</p><h2>What has changed?</h2><p>What made this leap possible is a convergence of key advancements:</p><ul><li><p><strong>Scalable training methods</strong>: especially through self-supervised learning, which unlocked ways to train models without labeled data.</p></li><li><p><strong>Smarter architectures</strong>: like transformers, which enabled generalization across different tasks.</p></li><li><p><strong>Advances in hardware and distributed training</strong>: which made it feasible to train enormous models on vast datasets and run large-scale experiments.</p></li></ul><p>These breakthroughs led to models that learned broad patterns across language, code, and images. Scaling laws taught us that bigger models, given the right ingredients, get dramatically better. Suddenly, one model could answer questions, write code, summarize documents, and carry on a conversation.</p><p>But what changed everything wasn&#8217;t just that models got better, it&#8217;s that they became <strong>accessible</strong>.</p><p><strong>Model-as-a-service</strong> flipped the AI equation. Now you can compose, customize, and deploy intelligent systems without ever developing a model. This lowered the barrier to entry, redefined who can build with AI, and what gets built.</p><p>AI isn&#8217;t just a research project anymore, it&#8217;s a <strong>software primitive</strong>. What used to be a machine learning challenge is now a software engineering opportunity.</p><p>The result? An explosion of AI-native products:</p><ul><li><p>Developers are shipping AI features in days and startups are launching products that would&#8217;ve taken years to build from scratch!</p></li><li><p>Entire workflows are being rebuilt around intelligent systems.</p></li></ul><blockquote><p><em><a href="https://greylock.com/greymatter/sam-altman-ai-for-the-next-era/">Sam Altman (OpenAI CEO) believes future AI value will come from customizing foundational models, not building them from scratch.</a></em></p></blockquote><p>And the potential is massive. It&#8217;s already reshaping how we work, learn, and create. This shift isn&#8217;t just technological; it's economic. <a href="https://www.pwc.com/gx/en/issues/analytics/assets/pwc-ai-analysis-sizing-the-prize-report.pdf">PwC predicts</a> AI could contribute up to <strong>$15.7 trillion</strong> to the global economy by 2030, with more than half of that driven by<strong> productivity gains</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OOYO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b48b44-57f7-4dfd-97fd-0593e0bafb54_2448x746.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OOYO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b48b44-57f7-4dfd-97fd-0593e0bafb54_2448x746.png 424w, https://substackcdn.com/image/fetch/$s_!OOYO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b48b44-57f7-4dfd-97fd-0593e0bafb54_2448x746.png 848w, https://substackcdn.com/image/fetch/$s_!OOYO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b48b44-57f7-4dfd-97fd-0593e0bafb54_2448x746.png 1272w, https://substackcdn.com/image/fetch/$s_!OOYO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b48b44-57f7-4dfd-97fd-0593e0bafb54_2448x746.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OOYO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b48b44-57f7-4dfd-97fd-0593e0bafb54_2448x746.png" width="1456" height="444" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88b48b44-57f7-4dfd-97fd-0593e0bafb54_2448x746.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:444,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:134799,&quot;alt&quot;:&quot;AI impact on global economy&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineeringunpacked.substack.com/i/165390267?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b48b44-57f7-4dfd-97fd-0593e0bafb54_2448x746.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AI impact on global economy" title="AI impact on global economy" srcset="https://substackcdn.com/image/fetch/$s_!OOYO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b48b44-57f7-4dfd-97fd-0593e0bafb54_2448x746.png 424w, https://substackcdn.com/image/fetch/$s_!OOYO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b48b44-57f7-4dfd-97fd-0593e0bafb54_2448x746.png 848w, https://substackcdn.com/image/fetch/$s_!OOYO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b48b44-57f7-4dfd-97fd-0593e0bafb54_2448x746.png 1272w, https://substackcdn.com/image/fetch/$s_!OOYO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b48b44-57f7-4dfd-97fd-0593e0bafb54_2448x746.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Where will the value gains come from with AI? - </em><a href="https://www.pwc.com/gx/en/issues/analytics/assets/pwc-ai-analysis-sizing-the-prize-report.pdf">Sizing the Price</a></figcaption></figure></div><h2>Core Techniques</h2><p>Let&#8217;s say your company wants to build a customer support chatbot. One that can answer user questions, handle orders, and maybe even process refunds. The AI engineer&#8217;s first job isn&#8217;t to dive into code, but to deeply<strong> understand the use case</strong>. What should the assistant know? How should it behave? What actions should it take? Most importantly: what does success look like, and how it will be measured?</p><p>Only then will the AI engineer begin <strong>customizing the model</strong> for the specific task.</p><h4>1. Guide with Prompts</h4><p>The first step is <strong>prompt engineering</strong>. This means crafting natural language instructions that guide the model&#8217;s behavior. You can define the assistant&#8217;s role, set its tone, and provide examples or constraints. When crafted effectively, prompts can deliver surprisingly strong results with minimal effort. You can learn more about prompt engineering in <a href="https://www.aiunpacked.net/p/prompt-engineering-guide">this issue</a>.</p><h4>2. Bring in Knowledge</h4><p>But prompts have limits. If the model struggles to answer specific questions, such as details about your company&#8217;s return policy, you need to give it access to external knowledge. This is where <strong>retrieval-augmented generation (RAG)</strong> comes in. Instead of packing all relevant info into a prompt, RAG pulls the right data on demand and feeds it to the model as context. This improves accuracy and expands the model&#8217;s knowledge without retraining it.</p><h4>3. Change the behavior</h4><p>If that still isn&#8217;t enough, and the model needs to follow more specific behavior or tone, <strong>fine-tuning</strong> may be the next step. This involves adapting a model on your own data to consistently adjust its outputs. Fine-tuning is more expensive and complex, so it is used only when necessary.</p><h4>4. Add Autonomy</h4><p>For even more advanced tasks, where the assistant needs to reason, plan, or carry out multi-step actions, such as verifying identity, checking inventory, and issuing a refund, you might explore <strong>agentic patterns</strong>. These systems treat the model as a reasoning engine, wrapped in tools, memory, and logic to act more autonomously. AI agents are promising, but still an area of active exploration in AI engineering.</p><p>Together, these techniques form the core toolkit of AI engineers. Knowing when and how to use them is key to building reliable, intelligent applications.</p><blockquote><p><em>&#8220;<a href="https://huyenchip.com/2025/01/16/ai-engineering-pitfalls.html">While fancy new frameworks and fine-tuning can be useful for many projects, they shouldn&#8217;t be your first course of action.</a>&#8221; - Chip Huyen</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dlAR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc94ade8a-9ee9-454b-8de6-0080e34a7984_1536x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dlAR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc94ade8a-9ee9-454b-8de6-0080e34a7984_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!dlAR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc94ade8a-9ee9-454b-8de6-0080e34a7984_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!dlAR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc94ade8a-9ee9-454b-8de6-0080e34a7984_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!dlAR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc94ade8a-9ee9-454b-8de6-0080e34a7984_1536x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dlAR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc94ade8a-9ee9-454b-8de6-0080e34a7984_1536x1024.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c94ade8a-9ee9-454b-8de6-0080e34a7984_1536x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:150504,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineeringunpacked.substack.com/i/165390267?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc94ade8a-9ee9-454b-8de6-0080e34a7984_1536x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dlAR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc94ade8a-9ee9-454b-8de6-0080e34a7984_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!dlAR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc94ade8a-9ee9-454b-8de6-0080e34a7984_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!dlAR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc94ade8a-9ee9-454b-8de6-0080e34a7984_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!dlAR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc94ade8a-9ee9-454b-8de6-0080e34a7984_1536x1024.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Challenges</h2><p>One of the core challenges in AI engineering is <strong>evaluation</strong>. Many tasks are open-ended, with no single correct answer, making it hard to measure progress or define success. Even for summarization is subjective. Let alone question answering or agent-based reasoning. Standard benchmarks often fall short, so teams rely on custom metrics, test suites, and real-time user feedback to track performance over time.</p><blockquote><p><em>&#8220;<a href="https://arxiv.org/pdf/2406.03339">Currently, there are no common methods or agreed-upon best practices to evaluate LLM-based applications.</a>&#8221;</em></p></blockquote><p>Another big challenge is <strong>latency and cost</strong>. LLMs are both computationally intensive and expensive to run. Even simple queries can take several seconds and require substantial compute resources. Tasks that require multi-step reasoning, such as planning or tool use, make both latency and cost worse. In user-facing applications, this kind of latency breaks the experience. No matter how impressive the output, if it takes too long, people won&#8217;t wait. Optimizing for speed while maintaining reliability and quality is a major ongoing challenge.</p><blockquote><p><em>&#8220;<a href="https://youtu.be/9V6tWC4CdFQ?t=2263">Sometimes latency may be even more important than intelligence</a>&#8221;</em> <em>- Lex Fridman</em></p></blockquote><p><strong>Reliability</strong> is equally difficult. These models are inherently unpredictable. A small change in input can lead to drastically different output, and the same prompt might not return the same result twice. This non-determinism makes debugging feel more like investigation than engineering. Guardrails and filters can improve behavior, but each layer adds complexity, introduces new failure points, and adds latency.</p><p>Building a prototype with generative AI is fast, turning it into a <strong>production-ready</strong> system is a different challenge entirely. What I&#8217;ve learned through building these systems is to start simple, ship quickly, and add complexity only when there is a clear reason to do so. In AI engineering, that discipline is necessary.</p><blockquote><p><em>&#8220;<a href="https://www.anthropic.com/engineering/building-effective-agents">When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed.</a>&#8221; - Anthropic</em></p></blockquote><h2>Your Jump-Start Plan</h2><p>I believe everyone can become an AI Engineer and<strong> the best way to learn is by building.</strong></p><p>If you've never worked with large language models before, now is the perfect time to start. You don&#8217;t need to understand all the internals. Just pick a simple idea and experiment.</p><p>Here&#8217;s a quick jump-start plan:</p><h4>1. Brainstorm an idea</h4><p>Think of a small, valuable use case. A great starting point is a task you do often, or a workflow you could automate.</p><h4>2. Break it down</h4><p>Take your idea and divide it into smaller steps. This helps you understand where LLMs can help.</p><h4>3. Build using an LLM API</h4><p>Use a foundation model like <strong>Gemini</strong> to start prototyping. Google&#8217;s Gemini API has a generous <strong>free tier</strong>, so you can get started without spending anything. Just go to <a href="https://ai.google.dev/gemini-api/docs">their website</a>, create an API key, and start building!</p><p>Here is an example to prompt a powerful model using just a few lines of Python code:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bcJ0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bbe70a-308f-4a39-9879-11cbf2c02eed_3324x1372.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bcJ0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bbe70a-308f-4a39-9879-11cbf2c02eed_3324x1372.png 424w, https://substackcdn.com/image/fetch/$s_!bcJ0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bbe70a-308f-4a39-9879-11cbf2c02eed_3324x1372.png 848w, https://substackcdn.com/image/fetch/$s_!bcJ0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bbe70a-308f-4a39-9879-11cbf2c02eed_3324x1372.png 1272w, https://substackcdn.com/image/fetch/$s_!bcJ0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bbe70a-308f-4a39-9879-11cbf2c02eed_3324x1372.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bcJ0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bbe70a-308f-4a39-9879-11cbf2c02eed_3324x1372.png" width="1456" height="601" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18bbe70a-308f-4a39-9879-11cbf2c02eed_3324x1372.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:601,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:298359,&quot;alt&quot;:&quot;Gemini API Quickstart&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiengineeringunpacked.substack.com/i/165390267?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bbe70a-308f-4a39-9879-11cbf2c02eed_3324x1372.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Gemini API Quickstart" title="Gemini API Quickstart" srcset="https://substackcdn.com/image/fetch/$s_!bcJ0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bbe70a-308f-4a39-9879-11cbf2c02eed_3324x1372.png 424w, https://substackcdn.com/image/fetch/$s_!bcJ0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bbe70a-308f-4a39-9879-11cbf2c02eed_3324x1372.png 848w, https://substackcdn.com/image/fetch/$s_!bcJ0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bbe70a-308f-4a39-9879-11cbf2c02eed_3324x1372.png 1272w, https://substackcdn.com/image/fetch/$s_!bcJ0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bbe70a-308f-4a39-9879-11cbf2c02eed_3324x1372.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://ai.google.dev/gemini-api/docs">Gemini API Quickstart</a></figcaption></figure></div><p>To help you get started, I&#8217;ve created a simple <a href="https://github.com/maxmuzych/ai-engineering-unpacked/blob/main/what-is-aie/ai_learning_coach.ipynb">example</a> that walks you through building an <strong>AI Learning Coach </strong>chatbot. It&#8217;s a real-world use case that demonstrates how integrate an LLM into your application through API and use basic techniques like prompt engineering and routing.</p><blockquote><p><em>Don&#8217;t aim for perfection. Start exploring, building, and learning.</em></p></blockquote><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.aiunpacked.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading <em><strong><a href="http://www.aiunpacked.net">AI Engineering Unpacked</a></strong></em>! Subscribe for free to learn how AI works and how to build real-world AI applications.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item></channel></rss>