AI Engineering Unpacked

Quantization Made Simple: How to Run Big Models on Small Hardware?

Max — Tue, 28 Oct 2025 13:07:34 GMT

When I worked in the healthcare domain, we faced a problem that probably sounds familiar to many of you. We needed to deploy a Large Language Model (LLM), but because of data privacy, everything had to stay on our client’s hardware. No cloud APIs. No external servers. Just us and their single GPU with 16GB of memory. Our specialized LLM had 8 billion parameters. The math was simple and brutal. It wouldn’t fit.

Through a technique called quantization, we managed to run that model smoothly on hardware that should have been too small. This post will help you understand what makes LLMs so demanding on memory, what quantization actually does to solve this problem, and how it manages to shrink models without breaking them. So let’s get into it!

Before continuing, take a look at this article to get a better understanding of how LLMs work.

Large Language Models Explained

Why You Should Care About This

LLMs are getting absurdly large. Some models now have hundreds of billions of parameters, with the largest reaching into the trillions. Even the “small” 7-billion-parameter models need significant hardware to run. This creates real problems! Renting GPUs with enough memory gets expensive fast. Not everyone can or wants to use cloud APIs. Developers want to run models locally on their laptops. Like our healthcare case, some data simply cannot leave the building due to privacy requirements.

Quantization offers a solution. This technique can cut your memory requirements in half or even to a quarter with barely any performance loss. That 16GB model can run on 8GB, sometimes even 4GB.

The Big Picture: What is Quantization?

Before we dive into the mechanics, let me give you an intuitive understanding of what we’re trying to achieve. Think about photos on your phone. You could store every picture in maximum quality RAW format, but that’s impractical. Instead, your phone compresses them to JPG. The files are 10x smaller, yet you barely notice the difference when viewing them.

Quantization does the same thing for LLMs. In simple terms:

Quantization reduces the precision of the numbers that make up your model, making it smaller while maintaining its performance.

It’s a compression technique, but instead of compressing pixels, we’re compressing the mathematical weights that power the model.

How Numbers Work in LLMs

To understand quantization, you need to know just one thing. Everything in a neural network comes down to numbers, billions of them. These numbers are called parameters or weights, and they represent what the model learned during training. They determine how the model processes your input and generates output. Each number is stored in computer memory using bits, which are just 0s and 1s. The more bits you use, the more precise the number becomes, but it also consumes more memory.

Modern LLMs typically use three different precision levels. The first is 16 bits (2 bytes), which is the standard training precision for most models. The second is 8 bits (1 byte), which is a common quantization target that provides 50% memory reduction and 1.56x speed up. The third is 4 bits (0.5 bytes), which is a more aggressive quantization that provides 75% memory reduction.

Number “33” represented with 8 bits

The Memory Math Made Simple

To understand how much memory is required to run an LLM, this is the most important formula you’ll need.

Let’s apply this to a real example with Llama 2 7B. With 16-bit precision, you need 7 billion parameters multiplied by 2 bytes, which equals 14 GB. With 8-bit quantization, you need 7 billion parameters multiplied by 1 byte, which equals 7 GB. With 4-bit quantization, you just need 3.5 GB. Same model, drastically different memory footprint.

During inference, when the model is generating text, you need extra memory for something called KV-cache. This cache stores context from the conversation.

The amount of extra memory depends on size of your context window.

Larger context windows, like 8K or 32K tokens, need significantly more memory than smaller ones like 2K or 4K tokens. For a 7B model in 8-bit with a typical 4K context window, you should plan for around 9GB of VRAM. If you’re tight on VRAM, you can reduce the context window to make the model fit.

How Quantization Actually Works

Now let’s peek under the hood and see what’s actually happening when we quantize a model. I promise to keep it simple, but understanding this will help you make better decisions about when and how to use quantization.

The core idea is that we’re mapping high-precision numbers to low-precision numbers.

Imagine you have a thermometer that measures temperature to the tenth of a degree, showing readings like 68.4°F, 68.8°F, and 69.9°F. Quantization is like switching to a thermometer that only shows whole numbers like 68°F, 69°F, and 70°F. You lose some detail, but you still get useful information.

A Simple Example

Let me show you how this works with a concrete example. Let’s say we want to quantize the number 33 from 8-bit to 4-bit representation.

In 8-bit space, numbers range from -128 to 127, giving us 256 possible values. In 4-bit space, numbers range from -8 to 7, giving us only 16 possible values. To convert between them, we need a scale factor.

The scale factor is calculated by dividing 256 by 16, which gives us 16. Now we can quantize our number. We take 33 and divide it by 16, which gives us 2.0625. After rounding, we get 2.

How numbers are converted from 8 to 4 bits representation

So the number 33 in 8-bit becomes 2 in 4-bit. When we need to use it again, we scale it back up by multiplying 2 by 16, which gives us 32. We lost a tiny bit of precision because 33 became 32, but we saved 50% of the memory.

This process happens for every single weight in the model, billions of times over. The accumulated small losses in precision are what lead to that minimal performance degradation I mentioned earlier.

To learn how different quantization techniques work in more detail, I recommend reading this article by Maarten Grootendorst.

Why This Doesn’t Break Your Model

You might be wondering why losing precision on billions of numbers doesn’t make the model terrible. The answer lies in how neural networks actually work.

LLMs are surprisingly robust to small amounts of noise.

They are extremely well optimized during training, so that they actually learn to be noise-resistant. This built-in resilience is what makes quantization possible without destroying performance.

Additionally, researchers use clever techniques to minimize the impact. Asymmetric quantization adjusts the mapping to better fit the actual distribution of weights. Per-channel quantization uses different scale factors for different parts of the model. Mixed precision keeps critical layers in higher precision while quantizing others more aggressively.

You don’t need to implement these techniques yourself because they’re built into modern quantization tools.

Common Quantization Formats

When you go looking for quantized models, you’ll see several formats. Understanding what they mean will help you choose the right one for your needs.

source Hugging Face

INT8 and Q8_0 refer to 8-bit integer quantization. This format provides 50% memory reduction with 99% or better performance retention. It’s best for production deployments where you want maximum safety and reliability.

GPTQ is a 4-bit quantization method that provides 75% memory reduction with 98% performance retention. It’s optimized for GPU inference and works best when you’re trying to run larger models on consumer hardware.

GGUF (formerly called GGML) is a flexible quantization format that supports anywhere from 2 to 8 bits. You’ll see various formats like Q4_K_M, Q5_K_S, and Q8_0. This format is optimized for CPU and Apple Silicon inference and powers popular tools like Ollama and LM Studio.

Share this with a friend who’s curious about AI Engineering

Performance Expectations

Different quantization levels give you different trade-offs between size and quality. With 8-bit quantization using INT8, you’ll barely notice any difference. I mean it when I say the performance is virtually identical to the original model.

With 4-bit quantization like Q4, you might see a slight quality reduction in very specific edge cases, but most users won’t notice in typical usage. With 3-bit or lower quantization, you’ll see noticeable quality degradation, so only use these formats if you’re desperate for memory.

The sweet spot for most people is 8-bit for critical production use and 4-bit for experimentation and local development.

Your Action Plan

Rule #1: Always Use 8-bit When Running Locally

If you’re deploying an LLM on your own, there’s no reason not to use 8-bit quantization. The performance difference is negligible, and you’ll save 50% on memory costs. It’s essentially free optimization.

Rule #2: Calculate Before You Download

Before pulling a model, you should check whether it’ll actually fit on your hardware. First, find the parameter count, which is usually in the model name like “Llama-2-7b” or “Mistral-7B”. Next, decide on your quantization level. Then apply the formula I showed you earlier. Finally, add a 20% buffer for KV-cache to be safe.

Quick Reference Table:

These estimates assume a 4K token context window. Larger context windows (8K, 32K, etc.) will require additional memory. If you’re constrained by VRAM, you can reduce the context window to fit your hardware.

Rule #3: Where to Find Quantized Models

You have two main options for getting quantized models.

Option 1 is using pre-quantized models on Hugging Face. Most popular models already have pre-quantized versions available. You can search for the model name plus “GPTQ” if you need GPU inference, or the model name plus “GGUF” if you need CPU or Mac inference. For example, instead of searching for “meta-llama/Llama-2-7b-hf”, you would search for “TheBloke/Llama-2-7B-GPTQ”.

You can find quantizations on the model page

Option 2 is quantizing the model yourself if you have a custom model or can’t find what you need. For GPTQ format, you can use the AutoGPTQ library. For GGUF format, you can use the llama.cpp conversion tools. For general quantization, you can use llm-compressor by vLLM.

Most tools require just a single command to quantize your model:

# Example with llm-compressor
llmcompressor quantize your-model --format int8

Rule #4: Test Before You Commit

Before deploying a quantized model, you should run your specific use cases through it. Create a small test set that includes typical queries you expect, edge cases that matter to your application, and quality metrics you care about.

Compare the quantized version against the original. In most cases with 8-bit, you’ll see identical results. With 4-bit, you might see tiny differences that you need to evaluate for your use case.

Wrapping up

Quantization isn’t a hack or a workaround. It’s a fundamental technique that makes LLMs accessible. It’s the reason you can run powerful models on consumer hardware. It’s why small teams can compete with big labs on deployment. It’s how that healthcare project actually shipped.

“Quantization is an essential tool for optimizing LLMs in real-world deployments.”

There are a few key points you should remember:

First, 8-bit quantization is practically free performance-wise, so use it by default.
Second, memory needed equals the billions of parameters multiplied by bytes per weight, multiplied by 1.2 for safety.
Third, most quantized models are pre-made and ready to download.
Fourth, when in doubt, try it because you can always go back to higher precision if needed.

The world of AI is moving fast, but it’s also becoming more accessible.

You don’t need a server farm to run state-of-the-art models anymore. You just need to know how to make them fit.

Now go make that model run on your hardware.

Have questions about quantization or want to share your own deployment story? Comment below and I will respond to every question.

Sampling in Large Language Models

Max — Thu, 18 Sep 2025 14:01:33 GMT

Today we will talk about one of the simplest building blocks of today’s large language models (LLMs), about sampling — strategies that exist, why they are needed, and how they actually work. Basically, I’ll share everything you need to know about sampling. Whether you’re an AI engineer or an enthusiast, this overview will give you the fundamentals needed to better understand and work with these models.

You can expect to get through this issue in about 6 minutes.

What is sampling and why do we need it?

Any artificial neural network (including an LLM) is just an extremely complex mathematical formula. Which means, the output is just a product of inputs and some static matrices. In other words, given the same input, LLM produces the same output every time.

This behaviour is fine for most of the applications: for example when we want our model to predict if an email is spam, we expect the same prediction for the same email every time. But that’s not the case for LLMs, where we often want it to be more “creative” and generate different responses each time we say “Hello”.

So how come LLMs generate each time a different answer?

Sampling Strategies

To answer this question, we first need to understand how these models work. I have a whole issue explaining how LLMs work. The key thing to understand is that at each generation step, the LLM’s final layer assigns a score to every word (token) in its vocabulary. These numbers reflect how likely the model thinks each word should come next.

At each step, an LLM predicts a logit (number) for every possible next token.

An LLM predicting possible completions with raw scores (logits)

The simplest strategy is to choose the token with the highest logit. This is called “greedy decoding” and it produces the same response every time you say “Hello”.

Instead of always picking the top token, some sampling strategies also consider other options. For example, for the sentence “The sky is…” the most likely word is “blue”, but the model might also choose “cloudy”, “gray”, or even “red”. This allows LLMs respond in a more creative and engaging way.

Now that we understand the basics of sampling, let’s look at the strategies most commonly used and how they work.

Converting logits to probabilities

To choose tokens based on probabilities, we first need to convert the logits (raw scores) into probabilities that sum to 1.

The key mathematical formula to make sampling strategies work is softmax.

The softmax equation looks like this:

, where z_i is the logit for token i, and K is the total number of tokens in the vocabulary. This function turns the set of logits into a probability distribution: all values are between 0 and 1, and they add up to 1.

To better understand this, let’s say the model needs to complete the sentence:

The sky is…

It now has the option to choose one of 4 words: blue, cloudy, gray, or red. Each of these words has an assigned logit: 9, 7, 6, and 3, respectively. Softmax converts these logits into a set of probabilities:

blue 84.2%
cloudy 11.4%
gray 4.2%
red 0.2%.

An LLM predicts possible completions with logits, then softmax converts them into a set of probabilities.

Now, instead of choosing word “blue” every time, we can choose one of these words according to the probability distribution. If we randomly sampled from this distribution 100 times, we’d expect to get “blue” about 84 times, “cloudy” about 11 times, etc.

Temperature

One of the most common parameters that control randomness of the output is called temperature.

In the softmax function, temperature is a constant, used to divide all logits. This makes the resulting probability distribution more “sharp” (if temperature < 1) or more “flat” (if it’s > 1). In other words, the higher the temperature is, the closer the probabilities become to each other, so the model is more likely to pick less probable tokens. Lowering the temperature has the opposite effect: it sharpens the distribution and makes the model stick to the most likely tokens.

If we set the temperature (T) to 5 and calculate probabilities for our example again, we would get:

blue 39.7%
cloudy 26.6%
gray 21.8%
red 12.0%.

Shift of probability distribution after increasing the temperature parameter

As you can see, the distribution becomes much flatter, and words that were unlikely before now have a much higher chance of being chosen.

Higher temperature makes the model’s output more diverse but also more “risky”.

It’s common to set temperature to 0 for consistent outputs. Technically, it can’t be 0 since logits can’t be divided by zero. In practice, this means the model simply does “greedy decoding”, skipping adjustment and softmax.

Try adjusting the temperature here and watch how the probability distribution shifts.

Top-K

Calculating softmax for every logit in the LLM vocabulary, which can be as large as 128,000 tokens, is computationally expensive. Instead of sampling from the entire vocabulary, the Top-K strategy considers only the tokens with the top k logits (where k is a parameter). For example, if k = 50, probabilities are calculated only for those 50 tokens instead of all 128,000.

A smaller k value makes the text more predictable but less interesting.

Top-P

As you can imagine, always sampling from the top K tokens can be suboptimal. For a yes/no question, the model should ideally choose between just two tokens: yes or no. But if you ask it to write a poem, you want a larger pool of tokens to encourage creativity.

That’s where the Top-P (Nucleus) sampling comes in. Instead of fixing K, it selects the smallest set of tokens whose probabilities add up to a threshold, usually 0.9 or 0.95. Since the probabilities of all tokens sum to 1, this subset covers the most likely ones while excluding very unlikely options.

In our earlier example, where “blue” has probability 0.84 and “cloudy” 0.11, setting P = 0.95 would limit sampling to these two tokens, since together they reach the threshold.

Top-P strategy doesn’t make sampling more efficient but it makes responses more coherent.

Stopping Condition

So we asked an LLM to complete a sentence. It generated logits for all possible next words, softmax turned them into probabilities, and a sampling strategy picked the next word. The process repeated. But when does it stop?

There are two stopping conditions:

The output hits the maximum token limit.
This is a parameter you can set. Stopping this way is not ideal, since it either cuts the response mid-sentence or produces an overly long, costly output.
The LLM generates an token.
This is the usual and preferred condition. LLMs are trained to produce a special token when the response is complete. You can think of it like pressing “send” after finishing a message.

Constrained Sampling

Many tasks require an LLM to generate output that follows a specific grammar. For example, it might need to produce a valid SQL query or a JSON object that matches a schema. This is critical because LLM outputs are often used in applications, and even a missing bracket in JSON can break downstream steps.

Even prompt engineering won’t guarantee that the LLM will follow your instructions and stick to the right format, whether you say “please” or not. To solve this, we can constrain sampling to tokens that preserve the grammar.

Previous sampling strategies focused on weighted sampling from a subset of tokens based on K or P parameters. Constrained sampling goes even further by allowing the model to choose only tokens that keep the output valid.

This can also speed up generation. Some tokens are almost guaranteed to follow others, such as a closing bracket after an opening one. In these cases, the model can skip sampling and output the token directly.

Constrained sampling paired with greedy decoding might turn your LLM into the most powerful extraction tool.

Constrained sampling is powerful and should be in every AI engineer’s toolkit, but it has downsides. It can be hard to implement, though many providers and engines (such as vLLM) support common grammars out of the box. It has also been shown to reduce LLM performance on reasoning tasks.

What’s next?

Now you should be able to answer these questions confidently:

What is sampling in LLMs and why do we need it?
What sampling strategies exist?
What are the limitations of “greedy decoding”?
How does Top-K and Top-P strategy work?

If you have any questions, leave a comment or reach out to me on LinkedIn.

Prompt Engineering 101

Max — Wed, 02 Jul 2025 13:25:19 GMT

Today, Large Language Models (LLMs) have become so capable that they are used to create, automate, and educate. In the previous issue, I explained how they work. Essentially, LLMs complete the input sequence you give them. This means the way you interact with them directly shapes their behavior. Whether you're building with them or simply using them, knowing how to prompt them is a key skill.

Note: a prompt is the input given to an LLM.

In any AI project that involves an LLM, prompt engineering is often the starting point. With a thoughtfully designed prompt, much of the work can be handled right from the beginning. But the final refinements and reliability are often the hardest to achieve.

Prompting may look simple at first, but under the hood it is a design problem. You are steering a probabilistic system, and small changes in the prompt can lead to very different outputs.

Prompt engineering has even become a standalone job title in some companies. While I believe it should ultimately be part of every AI engineer’s skill set, the fact that it is recognized as its own role shows just how important it has become.

Meme from Reddit

“The Problem is not with prompt engineering. It’s a real and useful skill to have. The problem is when prompt engineering is the only thing people know.”
- OpenAI Research Manager, when interviewed for AIE book.

💡In This Issue

We'll explore how to interact with models more effectively. You’ll learn the core principles behind good prompts, the mechanisms that shape model behavior, and the techniques that separate average outputs from great ones. Whether you're aiming for more control, better results, or just a deeper understanding of how these systems respond, this issue will give you the tools to get there.

Prompt Engineering is the easiest and most common way to adapt LLMs.

Technical Details

Before diving into prompt engineering techniques, it's important to understand a few core concepts.

Tokens

Tokens are the true “atoms” of LLMs.

Models don’t work with text directly, but instead they process and generate tokens. Understanding how tokenization works is important for efficient prompting. While it is out of scope for this issue, here is one thing to keep in mind.

Typos and odd formatting increase token count
Misspelled or oddly structured text may be broken into more tokens, wasting space.

System and User Prompts, and Messages

The system prompt is an initial instruction that sets the tone, style, behavior, or constraints for the model. For example, there is a hidden system prompt behind every ChatGPT conversation. It’s normally hidden from users, but they have been leaked in the past.

If you're a developer using an API, you must define the system prompt yourself. It's typically the first message in the input list, labeled with the role "system", and it serves to guide the model’s behavior at a high level.

The user prompt is what you, or the end user, actually type. There can be multiple user messages over the course of a conversation. These are typically labeled as "user" when using the API, and they contain the actual instructions, questions, or inputs you want the model to respond to.

When using the API, you normally construct a conversation as a list of messages, each with a role: "system", "user", or "assistant".

This message list is then combined into a single prompt behind the scenes using the model's tokenizer. Different providers have slightly different formatting, and some models (like DeepSeek’s R1) even recommend avoiding a system prompt altogether.

Understanding this message structure is key for anyone building interactive applications, especially those that rely on multi-turn conversations or consistent behavior across responses.

Special tokens

Special tokens are reserved tokens that serve structural or functional purposes. They might mark the beginning of a sequence, signal the start or end of a system or user message, or indicate when generation should stop.

For example, once a model generates a special end-of-sequence token, generation is terminated. Otherwise, it would continue until hitting a token limit.

System and User prompts formatted with special tokens for GPT-3.5-turbo

If you're using a model locally, it’s important to ensure your tokenizer adds these tokens correctly. Some tokenizers do this automatically. In my experience, when using LLaMA-3-8B for tool use, it performed poorly without special tokens, but worked well once they were added.

Parametric vs. Non-Parametric Memory

LLMs have two types of memory: parametric and non-parametric.

Parametric memory refers to information stored in the model’s parameters. This knowledge is acquired during training and can only be changed by updating the model’s weights. In other words, parametric memory is fixed unless the model is retrained or fine-tuned.

Non-parametric memory, on the other hand, includes everything the model sees in the current prompt. When we add extra context or information to a prompt, we are relying on non-parametric memory. This is the type of memory most accessible to developers and users.

Chat History

Now that we’ve covered memory types, we can explain how tools like ChatGPT appear to "remember" earlier messages. This is made possible through non-parametric memory.

Each time you send a message, it’s appended to the conversation history. With each new interaction, the full history is passed back to the model as part of the input. This is what allows the model to continue the conversation coherently.

Chat history visualized (from LangChain)

However, as the history grows, each response becomes more expensive to generate. Longer conversations require more tokens and compute. More importantly, LLMs get lost in multi-turn conversations, with an average drop of 39% across six generation tasks. Hence, restarting a conversation when it gets too long is crucial.

Context Window

LLMs have a fixed context window, which is the maximum number of tokens they can process in a single input. If the total prompt exceeds this limit, older parts of the conversation may be truncated or ignored entirely.

The size of this window has increased dramatically over time. GPT-2 had a context window of just 1,024 tokens, while state-of-the-art models like Gemini-2.5-Flash can handle up to 1 million tokens.

Studies have shown that when prompts are very long, models often "forget" information placed in the middle of the input. So while longer context windows allow for more information, they don’t guarantee better performance unless the prompt is structured carefully.

Context length also affects efficiency. Overlong prompts can introduce:

Unnecessary latency
Higher costs

Long prompts can also degrade the model’s performance.

When designing LLM applications, it’s important to balance richness of input with efficiency and relevance.

Prompting Best Practices

The golden rule of working with LLMs is simple: Better Input → Better Output.

Prompting can get incredibly tricky, as there is no guarantee that the model will follow your instructions, especially for smaller models. But systematic approach to prompt engineering can save you a lot of time.

Be Specific

LLMs perform best when your instructions are clear, explicit, and supported with context. Vague prompts often lead to vague or unpredictable responses.

Specify the Role

Assigning a role gives the model behavioral context. For example, “You are a helpful customer support assistant” nudges it toward tone, format, and intent aligned with that persona. If you're using the API, this is typically done through the system prompt.

When using LLM (Claude), you can dramatically improve its performance by using the system prompt to give it a role. - Anthropic

Write Clear Instructions

General prompts like “Help the user” leave too much room for interpretation. Instead, be explicit: “Answer customer questions about subscription plans using a friendly and professional tone.” Clear, specific directives reduce ambiguity and make the model's output more consistent.

Provide the Context

Without context, the model falls back on its internal training data, which may be outdated or misaligned with your task. Include relevant information, like your return policy, in the prompt to reduce hallucinations and increase accuracy.

Provide Examples

If there’s a particular style or format you want, show it. Providing one or more examples helps the model generalize and replicate the expected behavior. This is called In-Context Learning. For instance, if you need responses to be concise, include a short, well-structured example and tell the model to follow that pattern.

Break down complex tasks

LLMs struggle with ambiguity and perform inconsistently on large, multi-step tasks. If possible, decompose the workflow.

Splitting a problem into smaller steps makes your prompts easier to test, debug, and maintain. It also opens the door to parallel execution or routing simpler tasks to smaller, cheaper models.

For example, if your chatbot needs to process a refund, the task might involve:

Identifying which items need to be refunded
Checking refund eligibility
Providing a receipt and follow-up instructions

Rather than asking the model to handle all of this in a single prompt, you can break it into separate prompts and run them sequentially. This modular approach improves reliability and gives you more control over each step.

Example of prompt chaining for refund processing

The more narrow and deterministic your instructions, the more consistent and predictable your outputs will be.

Give the model time to think

Sometimes better results come not from adding more input, but from thinking more carefully.

One way to guide the model is by asking it to solve problems step by step. This helps it stay focused and follow a clearer line of reasoning. Another method asks the model to review and revise its own answer. That adds a layer of self-checking. Both methods aim to reduce errors and improve accuracy. They don’t always come free, as they can slow the response and use up more space.

Iterate

Prompt engineering is iterative by nature. Start with a basic instruction. Watch for errors. Adjust. Repeat.

Use versioned prompts and fixed test sets to evaluate systematically. Run the same prompt across different models to compare results. Intuition helps, but it’s not enough for production.

If you’re building applications, treat prompts like code. Keep them separate from the app’s logic. Version them. Annotate changes.

Without this structure, large-scale reliability is hard to maintain.

Prompt Engineering Techniques

For many users, following best practices will handle most cases. But if you’re building with LLMs, understanding and applying these techniques will allow you to build sophisticated AI applications.

Few-shot Prompting

When you ask an LLM to answer a question without giving any examples, it's called zero-shot prompting. If you include a few examples to show the model what kind of output you expect, this is known as few-shot prompting.

In the now-famous GPT-3 paper “Language Models are Few-Shot Learners”, researchers at OpenAI showed that, with just a handful of examples, LLMs could perform tasks that weren’t explicitly present in their training data, such as translation, question answering, or arithmetic.

In practice, however, few-shot prompting can be a double-edged sword. In my own work with LLaMA-3-8B, I found that few-shot examples sometimes hurt more than they helped. They consume valuable space in the context window (which was only 8,000 tokens in that model), and they can lead the model to copy details from the examples instead of focusing on the input. To avoid this, I recommend using a small number of generic examples (ideally 5 to 10) and abstracting away specifics. For instance, if you're extracting phone numbers, use placeholders like in the examples instead of real data. This is especially important with smaller models.

Chain-of-Thought

Researchers at Google discovered that prompting the model to reason through a task step by step, significantly improves performance on reasoning-heavy problems. This approach, called Chain-of-Thought (CoT) prompting, dramatically boosted PaLM-540B’s performance on a grade school math benchmark from 18% to 57%.

Asking a model to "think step by step" helps, but showing examples of that reasoning works better for specific tasks. A method, called Auto-CoT, aimed to automate this process.

Standard Prompting vs CoT Prompting

The insight here is simple: instead of treating the model like a calculator that produces an answer, you treat it like a problem solver that works through intermediate steps. This was also believed to make the model’s reasoning more transparent. However, Anthropic’s research reveals a key limitation: many CoTs do not faithfully reflect the model’s actual reasoning process, concealing how it arrived at its conclusions.

Despite this, CoT has had a major influence on the field. It inspired a surge of research into prompting techniques and reasoning architectures. Approaches like Self-Consistency and Tree-of-Thoughts, which samples multiple reasoning paths to find the most reliable answer, build on the core idea of encouraging deliberation and step-by-step problem solving.

More broadly, Chain-of-Thought reshaped how researchers think about using LLMs, not just as text predictors, but as agents capable of decomposing and reasoning through complex tasks. It laid the foundation for everything from advanced prompting methods to the emergence of reasoning models.

ReAct

Building on Chain-of-Thought, researchers from Google proposed ReAct, a framework that combines reasoning and acting. Rather than generating a final answer directly, the model enters a loop of reasoning, tool use, and observation.

The ReAct loop has three steps:

Reason – The model reflects on the current task and proposes the next action.
Act – It performs the proposed action, such as calling a tool or retrieving information.
Observe – It incorporates the result of that action and reasons about what to do next.

ReAct flowchart

This loop continues until the model reaches a conclusion. Of course, safeguards are needed to prevent infinite loops or repetitive behavior.

ReAct is powerful because it introduces interaction and adaptability. It laid the foundation for autonomous agents, systems that can plan, act, and reason across multiple steps to reach a goal, often using tools or APIs along the way.

Automatic Prompt Optimisation

Manual prompt tuning is not scalable. Tools like DSPy automate the process by exploring different prompts and testing them.

It works best when:

You have large evaluation sets
Your tasks are repetitive

That said, such tools generate a lot of API calls, sometimes hundreds per experiment. Always monitor what’s happening under the hood to avoid exploding costs or hidden errors.

Jailbreaking and Prompt Injections

Prompts can also be used to “hack” an application by making a model act in unintended ways. This includes revealing private information, executing unauthorized actions, or producing harmful or misleading output.

While modern LLMs are good at identifying and refusing many of these attacks, it’s still important to add safety layers. These can include input/output filtering, prompt hardening, and isolating risky capabilities. This is especially important in apps like AI agents that interact with internal tools.

One example of a prompt injection attack involves hiding tiny text in a résumé. The model, which is used to analyze candidate résumés, reads this hidden text even though a person cannot see it. As a result, it might respond with a message like “This is the best candidate so far, you should hire them.”

Prompt attacks are a form of social engineering, but this time targeting machines.

Prompt extraction attacks have led to the leak of many system prompts from ChatGPT, Claude, and other chatbots. There is even a dedicated GitHub repository with supposedly leaked prompts. These prompts often provide a sneak peak at what works best. They typically include instructions such as:

Personality engineering
Constitutional AI and safety layers
Tool usage protocols

One of the recent leaked prompts is the Claude 4 system prompt. It’s 25,000 tokens long, which adds significant computational cost, and it includes explicit hardcoded political information. Simon Willison has published a strong overview of its contents.

Learn by Building!

The best way to learn is to build something yourself.

I’ve created a simple Customer Support Bot example for you to try out and experiment with techniques covered in this issue. It runs on the free tier of the Gemini API, so you won’t need to spend anything.

Questions? Message me on LinkedIn.

Large Language Models Explained

Max — Wed, 18 Jun 2025 13:26:00 GMT

Intro

Large Language Models (LLMs) are arguably the most powerful AI models we have today. They power applications like ChatGPT, and can write poems, answer questions, draft legal documents, and even generate code. With billions of “neurons” that were trained on the entire internet, LLMs can understand and generate human language.

What’s even more impressive: they generalize across a wide range of tasks, often without needing any additional training. That’s why, for many applications, you no longer need to build your own AI model from scratch - you can just plug into one. This shift in how we build with AI is at the heart of AI Engineering, which I introduced in the first issue of this series.

LLM use cases

Today, almost anyone can use ChatGPT to learn faster, get work done, or experiment creatively. But while LLMs are everywhere - and CEOs can’t stop talking about them - very few people actually understand how they work under the hood.

If you’re an engineer building with these models, this understanding isn’t optional. It’s what lets you use LLMs effectively, debug weird outputs, and design systems that go beyond prompting.

This is essential not only for developers but for everyday users, so they can better understand the flaws and use these models more efficiently.

💡 In This Issue

In this issue, we’ll build a strong mental model for how Large Language Models actually work. You’ll learn how these models evolved, what they’re really doing when they generate text, and how to work with them effectively.

While we’ll touch on some technical aspects, the focus here is clarity — not complexity. We’ll leave the deep dives (like how attention works) for future issues. Today is all about getting the right mental model.

LLM is “Just” Next-Word Predictor

Ever typed a sentence and watched your phone suggest the next word? Now imagine that - scaled to billions of parameters and trained on most of the internet.

That’s a large language model.

LLMs create responses word by word based on user input.. They are basically predicting the next word but in ways that appear intelligent to humans.

But language modeling isn’t new.

The task of predicting the next word or sequence of words, has evolved over decades from early rule-based systems that were rigid and limited, to statistical n-gram models that introduced probabilities but struggled with longer context, and finally to neural networks like RNNs and LSTMs in the 2010s, which improved performance using deep learning but still faced challenges with long-range dependencies.

Then came a breakthrough.

Transformers Changed Everything

In 2017, Google researchers proposed the new neural network architecture - Transformer in the now-famous paper “Attention is All You Need”.

It introduced self-attention mechanism, allowing models to understand language much better, especially longer sequences.
This also made training vastly more parallelizable - a perfect match for modern compute infrastructure.

Transformers became the foundation of models like BERT, GPT, and LLaMA. Today, nearly every state-of-the-art NLP model uses this architecture.

Transformers can be adapted to different tasks:

Encoders (e.g. BERT) for classification and entity recognition.
Decoders (e.g. GPT) for text generation.
Encoder-decoder models (e.g. T5) for translation, summarization, and question answering. Though today, many of these tasks are handled by decoder-only models.

In this issue, we focus on decoder-only architecture like ChatGPT - the ones that generate language, word by word, to simulate conversation, write code, solve problems, and more.

Emergent capabilities

Even though LLMs are trained just to predict the next word, they can end up doing things that look surprisingly smart.

🧠 Mimicked Reasoning

By generating text one word at a time, they can follow step-by-step reasoning, like solving a math problem or explaining a concept. This “thinking out loud” often leads to better answers, simply by writing down each small step.

🛠 Tool Use

The same word-by-word generation also enables tool use. For example, if connected to a calculator or a search engine, a model can write something like calculate(2 + 2) or search("weather in Paris") and the system will recognize that as a tool call. The model doesn't need to know what a calculator is; it just learns to write the right words to get the job done.

🤖 Agentic Behavior

With the right setup, LLMs can also carry out multi-step tasks—deciding what to do next, using tools, checking results, and continuing—all just by continuing a text. This kind of structured problem-solving is called an agentic workflow, and it’s powered entirely by next-word prediction.

So, these "probabilistic parrots" display surprisingly sophisticated behaviors. Their simple objective, when scaled and trained on diverse data, gives rise to previously unseen capabilities.

Training

Training a neural network involves adjusting its internal parameters, so that its behavior begins to mirror human-like understanding. By showing it tons of examples of input-output pairs, the model starts to uncover patterns in language and uses these to make smart predictions on new, unseen text. For LLMs, this learning happens in two major stages: pre-training and post-training.

Note: LLMs are not operating on raw text. Instead, they operate on tokens. You can think about them as words (e.g. “learn“) or subwords (e.g. “ed”, “ing“).

Pre-training

The first and most computationally intensive phase is called pre-training. Here, the model is exposed to vast amounts of raw text from books, articles, websites, forums, and other public sources. It learns by predicting the next token in a sentence, like completing:

“To make a chocolate cake, first preheat the ...” → “oven”.

This simple game of next-token prediction turns out to be surprisingly powerful. It enables the model to learn grammar, facts about the world, reasoning patterns, and even some basic common sense, all without explicit human supervision. This is why it’s called self-supervised learning: the supervision signal (what the "correct" answer is) comes from the data itself.

Data Collection

LLMs are trained on enormous amounts of text that is far more what any human could absorb. Meta’s LLaMA 3, for example, was trained on 15 trillion tokens, more than a person might read in a lifetime.

They learn not through deep experience, but massive breadth.

To reach this scale, developers crawl the web and license large datasets. Common sources include Common Crawl, which scrapes billions of web pages regularly. The data then passes through filters to improve quality and reduce harm. The filtered open dataset is FineWeb with 15 trillions tokens.

The data collection process remains controversial: many documents are scraped without permission, raising legal and ethical concerns.

Objective: Autoregressive Language Modeling

Most modern LLMs are trained as autoregressive language models. That means they take a sequence of tokens (e.g., words or subwords) and learn to predict the next token, one step at a time.

The training dataset is split into chunks of different size, these chunks are then used to train the model. At each step, the model sees all the previous tokens and generates a probability distribution over what word should come next. It is simply trained to memorize what usually comes next in human language, not to understand explicitly. But as it ingests more data, those patterns begin to encode complex ideas and knowledge structures.

Example of a training sample for an LLM

Infrastructure & Scaling Laws

Training these models requires enormous compute infrastructure. Clusters of specialized GPUs are used to train a model in parallel over weeks or months.

Why train such large models? Because scaling laws show that performance continues to improve as we scale up size of the model and training data. And this in turn requires more compute.

Post-training

At the end of pre-training, we have a base model: a powerful, general-purpose text generator that’s read a large fraction of the internet. But it’s not yet an assistant. If you ask it:

“What’s your name?”

It might respond with:

“What’s your surname?”

because that phrase often follows in web forms the model saw during training.

Even worse, it may reproduce offensive or harmful language seen during training. That’s why base models are typically not exposed directly to users.

To make the model more helpful and harmless, we run it through post-training.

Post-training turns raw linguistic intelligence into trustworthy interaction.

Pre-training unlocks capability. Alignment unlocks usability.

1. Instruction Fine-Tuning

The first step in post-training is supervised fine-tuning (SFT), also called instruction tuning. Here, the model is shown curated examples of how it should behave in assistant-like conversations:

User: What’s your name?  
Assistant: My name is ChatGPT, a language model developed by OpenAI.

This includes both synthetic conversations and manually written examples by human experts. The learning objective is the same as pre-training—predict the next token—but now the training examples are dialog turns, not internet text.

SFT teaches the model how to:

Follow instructions
Be polite and informative
Refuse unsafe or inappropriate requests

It’s how the model begins to simulate helpful behavior.

2. Reinforcement Learning / Preference Optimization

Instruction fine-tuning gets you a competent assistant, but it still imitates human-written answers without deeper judgment. To take it further, we apply reinforcement learning (RL).

There are two major goals here. First - preference alignment - teaches the model to oroduce responses that humans prefer. Second - reasoning emergence - encourages the model to discover and use multi-step reasoning strategies.

Reinforcement Learning from Human Feedback (RLHF)

The core of RLHF is to generate several candidate answers to each prompt and have humans rank them, so the best answer gets the highest score. These rankings are used to train a reward model that estimates how much a human would prefer each response. Next, the language model (policy) is fine-tuned using reinforcement learning (often PPO) to maximize the reward signals. This process allows the model to explore new outputs that go beyond simply imitating training examples, learning to produce responses that align more closely with human preferences.

Note: Some preference alignment methods, like DPO and SimPO, were inspired by RLHF but do not use reinforcement learning. They simplify the process and have been shown to perform as well or better than RLHF in many tasks.

RL Unlocks Reasoning

Perhaps the most exciting result of post-training is that reasoning emerges.

RL-tuned models (e.g., GPT-o1, DeepSeek R1) don’t just answer questions—they think through them:

Break down problems into steps
Double-check answers
Try alternative approaches

These reasoning patterns weren’t necessarily present in the training data, the models discover them through these RL methods.

What Happens Inside the Model? (4 Core Steps)

Now that the model was trained, let’s explore how an LLM like ChatGPT works under the hood when you interact with it.

Let’s say you have a torn recipe that looks like:

“To make a chocolate cake, first preheat the…”

You would easily guess that “oven” is the next word. Let’s explore how an LLM arrives at this prediction in four key steps.

Tokenization: Translating Words Into Numbers

Tokens are the true “atoms” of LLMs.

Everything an LLM does, whether it’s generating fluent text or hallucinating facts, emerges from how it processes tokens. In fact, poor tokenization often hurts performance more than having fewer parameters.

But what exactly are tokens, and why do we need them?

To work with text, models need to convert it into numbers. The most naïve approach is to treat each character as a token and assign it a number. But this leads to two big problems:

Sequences become extremely long, which slows everything down—training, inference, memory use.
Patterns become harder to learn. At the character level, meaningful structures are broken into tiny pieces. That makes it much harder for the model to learn how language actually works.

Think about it: when you write a sentence, you don’t think one letter at a time—you think in words or phrases.

The next idea might be: just assign an ID to every word. That seems more natural, but it creates new issues:

Rare words are a problem. If a word barely appears in the training data, the model won’t learn much about it.
Misspellings, slang, and new words break the system. With a pure word-level approach, the model has no way to handle something it hasn’t seen before.

The Subword Solution

So we split the difference. Instead of characters or full words, tokenizers break text into subword units—smaller chunks that balance vocabulary size with expressive power.

Take the word “preheat.” It splits into two tokens: "pre" and "heat". This allows the model to:

Learn meanings more efficiently by sharing representations across related words (e.g., “heat,” “heatting,” “preheat”).
Understand rare or unseen words by recombining known pieces.

Asking an LLM to count how many letters are in a word often fails, because it never sees letters. It sees tokens, which might represent whole words or subwords.

Tokenizers are vocabularies that translate text into numbers (and back), we will explore how they are created in separate issues. For now it’s important to understand how text is converted to tokens during this first step.

How GPT-4o tokenizer translates our example

You can play with tokenizers and explore how different models “see” the input text using tiktokenizer.

Tokenization isn’t just a preprocessing detail - it shapes how the entire model understands language.

Embedding: Understanding words’ meaning

After text is tokenized, the first step in an LLM is to convert each token into a dense vector known as an embedding. These vectors live in a high-dimensional space (often with hundreds or even thousands of dimensions) where tokens with related meanings are positioned close together. For example, “cake” near “pastry” or “chocolate” near “vanilla.” This mapping is done through a learned embedding table, which assigns each token an initial vector based on patterns seen during pre-training. At this stage, embeddings are static: they don’t yet account for context.

Example of word embeddings in 2D space

Still, even these initial embeddings encode rich semantic structure. They allow the model to compare meanings, detect similarities, and perform simple conceptual arithmetic, like subtracting “man” from “king” and adding “woman” to get something close to “queen.” Embeddings also play a central role in external tasks like retrieval or search, where specialized embedding models are trained to produce vector representations of entire passages or queries.

Attention: Understanding the context

Once tokens are embedded, the model needs more than just their meanings—it also needs to understand their order. Unlike humans, it has no built-in sense of sequence, so positional information is added to the token vectors. This helps the model distinguish between phrases like “first preheat the” and “preheat the first.”

With position and meaning combined, the model begins its core task: connecting the dots through attention. Attention layers allow the model to look at all other words in the sentence and decide which ones matter most.

Here’s how: each word creates a query, and compares it to keys from all the other words to see which ones are most relevant. If a match is strong (like between the query “cake” and the key “chocolate”) the model pays more attention to that connection. The actual content that gets passed along is stored in values, which are blended based on how strong each match is.

So when it encounters “chocolate cake,” attention strengthens the link between the two, refining the meaning of “cake” into something more specific.

How attention refines meaning of 'cake' into 'chocolate cake'.

This process repeats across many layers. Each one applies attention to capture relationships, followed by a feed-forward network that transforms the results. With every pass, the model deepens its understanding by layering new patterns onto old ones.

By the final layer, “cake” isn’t just a baked good, it’s a chocolate cake being prepared in an oven. The meaning has evolved through a sequence of updates shaped by the entire sentence.

This ability to build meaning through understanding connections is what gives LLMs their power.

Sampling: Choosing the next word

Now that the model understands we’re talking about a chocolate cake, it’s ready to predict what comes next. After passing through all layers, final representation is used to compute a score (logit) for every word in the vocabulary. These scores are turned into probabilities using a softmax function.

For example:

Oven       – 90%  
Microwave  – 5%  
Pan        – 3%  
Other      – 2%

Here, “oven” clearly stands out as the most likely next word. Instead of always picking the top one, we sample. It’s like rolling weighted dice, where higher-probability tokens are more likely to be chosen.

This sampling step is what gives LLMs their creativity and diversity. Without it, outputs would be repetitive and dull. Every recipe would look the same.

There are different sampling strategies that allow you to steer the model’s output toward different goals: more creative, more predictable, more diverse, or more structured. There is another issue that explains sampling in LLMs in detail.

“Choosing the right sampling strategy can significantly boost a model’s performance with relatively little effort” - Chip Huyen

This entire process of understanding, prediction, and sampling, continues until the recipe is complete or reaches a natural stopping point, such as reached limit of max tokens or generated end-of-sequence special token.

Think of the whole process like a super-advanced version of completing a sentence, where each word choice is informed by understanding the meaning of all previous words and their relationships to each other. The model does this by converting words to numbers, understanding their basic meanings, analyzing their relationships, making informed predictions, and building the response one word at a time.

Basic flowchart of how LLM processes text

Limitations & Mitigations

Despite their impressive capabilities, LLMs are not magic. A useful way to think about their limits is via the “Swiss cheese” model, formulated by Andrej Karpathy:

LLMs are solid and capable overall, but full of unpredictable holes.

You can get fluent, intelligent output one moment and nonsense the next. Understanding these limitations helps avoid mistakes and gives you ways to prompt more effectively.

Hallucinations and Knowledge Cutoff

LLMs like ChatGPT are trained to be helpful assistants that always try to answer your questions. That’s why, even when they don’t know something, they might still respond politely and confidently, and sometimes incorrectly. This is called hallucination.

Note: Common knowledge is reinforced by frequent patterns in the training data, but rare or obscure facts are less reliably encoded and more prone to errors.

Mitigation

Give the model enough context, or
Use search tools if possible.
For mission-critical use, validate output with other systems.

Math and Spelling

LLMs struggle with precise tasks like counting or character indexing because:

They operate on tokens, not characters.
For example, “berry” might be a single token, so the model doesn’t "see" individual letters and thus doesn’t know how many “r”s in “strawberry”.
Arithmetic is not performed symbolically but learned statistically.
As we've discussed, the model works by predicting the most likely next token based on patterns in its training data. That’s why, for complex or uncommon equations, it may generate answers that sound plausible, but are actually incorrect.

Mitigation

Let the model use tool such as code.
In that way, instead of performing calculations by predicting next tokens, it will write a piece of code that will do this programmatically and then give you an answer.

Limited Context Window

LLMs have a limited memory: they can only process a certain number of tokens at a time. This limit is known as the context window. It is the maximum number of tokens the model can “see” and use to generate a response.

For example, GPT-4 can handle up to 128,000 tokens, which covers hundreds of pages of text. But anything beyond that is invisible to the model. It doesn’t remember earlier parts unless they fall within the current window.

Even within the token limit, performance can degrade as the input gets longer. Models tend to focus more on the most recent tokens and may overlook important details in the middle. So while longer context windows are useful, they come with trade-offs in accuracy, speed, and cost.

Mitigation

Restart conversations when they get too long.
Repeat or summarize key information periodically.
Put important information at the beginning and near the end.

Prompting Tips

Keep in mind the golden rule of working with LLMs:

Better input → better output.

LLMs don’t read your mind. They complete patterns. What you prompt is what you’ll receive.

A clear prompt often follows a simple structure:
1. Set the persona to give the model a role or mindset
2. Provide context with any background or constraints the model should know
3. Specify the task by clearly stating what you want it to do
4. Declare the format so it knows how the output should look

Prompt example:

You are a travel writer.  
Here’s background info on Paris: I have 10 hours lay over.  
List 5 must-see landmarks.  
Format: bullet points with 1-sentence descriptions.

To give model better understanding of the task and/or output format, you can provide examples. This is also called few-shot prompting.

Prompt example:

Please convert HTML to markdown. 
Here are some examples:
 Input: Header
 Output: # Header
Convert this: Bold Text

If you are asking to solve complex task that requires logic reasoning, encourage a model to think step-by-step. This technique is called Chain-of-Thought.

What is AI Engineering?

Max — Wed, 11 Jun 2025 09:30:00 GMT

Artificial Intelligence (AI) is everywhere now. But just a few years ago, building intelligent software meant months of data preparation, model training, and complex infrastructure. It felt like something only research labs or tech giants could afford.

That’s no longer true.

With just a few lines of code, you can plug into some of the most powerful AI models ever created. These models are your building blocks, they’re like LEGO bricks. You don’t need to shape each brick yourself. Just imagine what to build, put pieces together, and bring your ideas to life.

This is AI Engineering. It is not about creating models from scratch, but about turning powerful models into useful products. You focus on the design, function, and impact.

No PhD required. No need to be a machine learning expert. The tools are accessible, and the opportunity is enormous. Today, everyone can start building AI applications.

AI Engineering is one of the fastest, and quite possibly the fastest-growing, engineering discipline.

So whether you're an experienced software engineer or a curious builder, this newsletter will help you to bridge theory and practice. You’ll learn the key ideas behind modern AI systems, and how to apply them to build real-world products.

Welcome to AI Engineering Unpacked.

From ML Engineering to AI Engineering

AI applications aren’t new. Translation apps, camera autofocus, spam filters, these have all used AI for years. But building them used to be slow and expensive. Teams of ML researchers and engineers had to curate labeled data, design and train models, and deploy custom infrastructure. It could take months to ship even a basic product.

That was classical ML engineering: start with data, build a model, then wrap it in an application.

Today, that process has flipped.

With Large Language Models (LLMs) at your fingertips, you can build a translation app or a chatbot in a single evening. AI engineers no longer begin with data pipelines or model training. They start with the problem, design the user experience, and plug in powerful models to solve it. Only then do they customize, optimize, or fine-tune if needed.

Inspired by “The Rise of the AI Engineer”

This shift is changing what the role looks like. AI engineering blends software development, systems thinking, and human-centered design. It's less about training models and more about integrating intelligence into products. Pre-trained models are core components now, and AI engineering techniques are becoming standard tools.

The job now looks a lot like full-stack engineering, with a deep understanding of how large language models work under the hood.

From my own experience as Head of AI, this has changed how I hire. I don’t just look for ML expertise, I look for software engineering skills as well. It’s not just about knowing the models, it’s about knowing how to ship great products. That blend is what makes someone a great AI engineer.

That’s the essence of AI engineering: fast iteration, user focus, and turning cutting-edge models into real-world impact. You don’t need to wait to get started. The tools are here. And you can learn by building. Today.

What has changed?

What made this leap possible is a convergence of key advancements:

Scalable training methods: especially through self-supervised learning, which unlocked ways to train models without labeled data.
Smarter architectures: like transformers, which enabled generalization across different tasks.
Advances in hardware and distributed training: which made it feasible to train enormous models on vast datasets and run large-scale experiments.

These breakthroughs led to models that learned broad patterns across language, code, and images. Scaling laws taught us that bigger models, given the right ingredients, get dramatically better. Suddenly, one model could answer questions, write code, summarize documents, and carry on a conversation.

But what changed everything wasn’t just that models got better, it’s that they became accessible.

Model-as-a-service flipped the AI equation. Now you can compose, customize, and deploy intelligent systems without ever developing a model. This lowered the barrier to entry, redefined who can build with AI, and what gets built.

AI isn’t just a research project anymore, it’s a software primitive. What used to be a machine learning challenge is now a software engineering opportunity.

The result? An explosion of AI-native products:

Developers are shipping AI features in days and startups are launching products that would’ve taken years to build from scratch!
Entire workflows are being rebuilt around intelligent systems.

Sam Altman (OpenAI CEO) believes future AI value will come from customizing foundational models, not building them from scratch.

And the potential is massive. It’s already reshaping how we work, learn, and create. This shift isn’t just technological; it's economic. PwC predicts AI could contribute up to $15.7 trillion to the global economy by 2030, with more than half of that driven by productivity gains.

Where will the value gains come from with AI? - Sizing the Price

Core Techniques

Let’s say your company wants to build a customer support chatbot. One that can answer user questions, handle orders, and maybe even process refunds. The AI engineer’s first job isn’t to dive into code, but to deeply understand the use case. What should the assistant know? How should it behave? What actions should it take? Most importantly: what does success look like, and how it will be measured?

Only then will the AI engineer begin customizing the model for the specific task.

1. Guide with Prompts

The first step is prompt engineering. This means crafting natural language instructions that guide the model’s behavior. You can define the assistant’s role, set its tone, and provide examples or constraints. When crafted effectively, prompts can deliver surprisingly strong results with minimal effort. You can learn more about prompt engineering in this issue.

2. Bring in Knowledge

But prompts have limits. If the model struggles to answer specific questions, such as details about your company’s return policy, you need to give it access to external knowledge. This is where retrieval-augmented generation (RAG) comes in. Instead of packing all relevant info into a prompt, RAG pulls the right data on demand and feeds it to the model as context. This improves accuracy and expands the model’s knowledge without retraining it.

3. Change the behavior

If that still isn’t enough, and the model needs to follow more specific behavior or tone, fine-tuning may be the next step. This involves adapting a model on your own data to consistently adjust its outputs. Fine-tuning is more expensive and complex, so it is used only when necessary.

4. Add Autonomy

For even more advanced tasks, where the assistant needs to reason, plan, or carry out multi-step actions, such as verifying identity, checking inventory, and issuing a refund, you might explore agentic patterns. These systems treat the model as a reasoning engine, wrapped in tools, memory, and logic to act more autonomously. AI agents are promising, but still an area of active exploration in AI engineering.

Together, these techniques form the core toolkit of AI engineers. Knowing when and how to use them is key to building reliable, intelligent applications.

“While fancy new frameworks and fine-tuning can be useful for many projects, they shouldn’t be your first course of action.” - Chip Huyen

Challenges

One of the core challenges in AI engineering is evaluation. Many tasks are open-ended, with no single correct answer, making it hard to measure progress or define success. Even for summarization is subjective. Let alone question answering or agent-based reasoning. Standard benchmarks often fall short, so teams rely on custom metrics, test suites, and real-time user feedback to track performance over time.

“Currently, there are no common methods or agreed-upon best practices to evaluate LLM-based applications.”

Another big challenge is latency and cost. LLMs are both computationally intensive and expensive to run. Even simple queries can take several seconds and require substantial compute resources. Tasks that require multi-step reasoning, such as planning or tool use, make both latency and cost worse. In user-facing applications, this kind of latency breaks the experience. No matter how impressive the output, if it takes too long, people won’t wait. Optimizing for speed while maintaining reliability and quality is a major ongoing challenge.

“Sometimes latency may be even more important than intelligence” - Lex Fridman

Reliability is equally difficult. These models are inherently unpredictable. A small change in input can lead to drastically different output, and the same prompt might not return the same result twice. This non-determinism makes debugging feel more like investigation than engineering. Guardrails and filters can improve behavior, but each layer adds complexity, introduces new failure points, and adds latency.

Building a prototype with generative AI is fast, turning it into a production-ready system is a different challenge entirely. What I’ve learned through building these systems is to start simple, ship quickly, and add complexity only when there is a clear reason to do so. In AI engineering, that discipline is necessary.

“When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed.” - Anthropic

Your Jump-Start Plan

I believe everyone can become an AI Engineer and the best way to learn is by building.

If you've never worked with large language models before, now is the perfect time to start. You don’t need to understand all the internals. Just pick a simple idea and experiment.

Here’s a quick jump-start plan:

1. Brainstorm an idea

Think of a small, valuable use case. A great starting point is a task you do often, or a workflow you could automate.

2. Break it down

Take your idea and divide it into smaller steps. This helps you understand where LLMs can help.

3. Build using an LLM API

Use a foundation model like Gemini to start prototyping. Google’s Gemini API has a generous free tier, so you can get started without spending anything. Just go to their website, create an API key, and start building!

Here is an example to prompt a powerful model using just a few lines of Python code:

Gemini API Quickstart

To help you get started, I’ve created a simple example that walks you through building an AI Learning Coach chatbot. It’s a real-world use case that demonstrates how integrate an LLM into your application through API and use basic techniques like prompt engineering and routing.

Don’t aim for perfection. Start exploring, building, and learning.

AI Engineering Unpacked

Quantization Made Simple: How to Run Big Models on Small Hardware?

Why You Should Care About This

The Big Picture: What is Quantization?

How Numbers Work in LLMs

The Memory Math Made Simple

How Quantization Actually Works

A Simple Example

Why This Doesn’t Break Your Model

Common Quantization Formats

Performance Expectations

Your Action Plan

Rule #1: Always Use 8-bit When Running Locally

Rule #2: Calculate Before You Download

Rule #3: Where to Find Quantized Models

Rule #4: Test Before You Commit

Wrapping up

Sampling in Large Language Models

What is sampling and why do we need it?

Sampling Strategies

Converting logits to probabilities

Temperature

Top-K

Top-P

Stopping Condition

Constrained Sampling

What’s next?

Prompt Engineering 101

💡In This Issue

Technical Details

Tokens

System and User Prompts, and Messages

Special tokens

Parametric vs. Non-Parametric Memory

Chat History

Context Window

Prompting Best Practices

Be Specific

Specify the Role

Write Clear Instructions

Provide the Context

Provide Examples

Break down complex tasks

Give the model time to think

Iterate

Prompt Engineering Techniques

Few-shot Prompting

Chain-of-Thought

ReAct

Automatic Prompt Optimisation

Jailbreaking and Prompt Injections

Learn by Building!

Further Reading

How-to Guides

Prompt Examples

Large Language Models Explained

Intro

💡 In This Issue

LLM is “Just” Next-Word Predictor

Transformers Changed Everything

Emergent capabilities

Training

Pre-training

Data Collection

Objective: Autoregressive Language Modeling

Infrastructure & Scaling Laws

Post-training

1. Instruction Fine-Tuning

2. Reinforcement Learning / Preference Optimization

What Happens Inside the Model? (4 Core Steps)

Tokenization: Translating Words Into Numbers

The Subword Solution

Embedding: Understanding words’ meaning

Attention: Understanding the context

Sampling: Choosing the next word

Limitations & Mitigations

Hallucinations and Knowledge Cutoff

Math and Spelling

Limited Context Window

Prompting Tips