Understanding Context Windows

August 11, 2025

Lately, I've been particularly interested in "context windows" because it's core to building effective AI products, like Kalen. In this post, I hope to explain the basics of context windows, why they are important, and how they are being used today.

What Are Context Windows

First, what is a context window? It's just a fancy term for how much text you can feed into an LLM at once. Technically, it includes the output text as well, but regardless, the context window is a limit, and longer contexts cost more to process, take longer to respond, and use more server resources.

To simplify, think of models having memorized an encyclopedia. It "knows" tons of facts, but they're frozen at a point. Context window is more like a workspace, much more limited than the model knowledge but dynamic and editable. For example, the model "knows" Python exists (encyclopedia) but needs to see your specific code (workspace) to debug it.

Context windows are measured in tokens, which are pieces of words, punctuation, spaces, and even formatting. A context window can contain conversation history, documents, system instructions, and responses. In the early days, LLMs had much smaller context windows, but this was fine because people were using LLMs for simple question and answers. As models got better, people tried more complex tasks, which hit the context window limits. Model providers are continuing to increase context windows: GPT-3 (4K) → GPT-4 (8K→128K) → Claude (200K) → Gemini (1M+) → experimental models pushing 10M+ tokens. To some extent, this is part of the arms race between model providers. Every major model release now competes on context window size alongside capability improvements.

From Prompt Engineering to Context Engineering

When I first started working with LLMs, prompt engineering was all the rage. The focus was on crafting the perfect prompt to get better outputs - the right instructions, examples, few-shot learning patterns.

As context windows got bigger and people started using LLMs for complex, multi-step tasks, something shifted. The challenge moved from "How do I phrase this request perfectly?" to "How do I manage information strategically across long conversations?" That's when "context engineering" started becoming a thing.

Prompt engineering and context engineering share the same space. Few-shot examples, system prompts, documentation all consume context tokens. So there's a fundamental trade-off. Adding 5 few-shot examples for better prompt engineering might use 2,000 tokens, leaving less space for your actual conversation.

Context Window Management in Practice

Here are some key ways to think about context windows:

  1. Relevant reference documentation: include only the most relevant docs or code snippets, not entire codebases.
  2. Conversation continuity: maintain the thread of important decisions across long discussions.
  3. Iterative refinement: structure conversations to build on previous context rather than starting over each time.
  4. Prioritize information: front-load the critical context and put nice-to-have details later.

Why this matters: Better prompts mean fewer iterations, which means more context space for actual work. It prevents the "lost thread" problem in long conversations and helps you work within current limitations while preparing for memory-enabled AI. It also saves you money.

From what I see, there are several main approaches that people are using today to work around context limitations. All are about information retrieval at runtime.

RAG: External Memory Systems

RAG (Retrieval Augmented Generation) is like giving the AI a searchable external hard drive instead of keeping everything in "RAM." How it works: you store information in vector databases or knowledge bases, search for relevant context when needed, and inject only the most relevant pieces into the active context window. Think of it like a smart research assistant that can pull exact documents from a vast library. The limitation right now is that quality depends heavily on retrieval accuracy - sometimes it pulls irrelevant information.

MCP: Dynamic Context Through Tools

Model Context Protocol lets AI access external tools and data sources. The breakthrough here is that instead of cramming everything into the context window upfront, the AI calls tools when needed. How does it decide which tools to call? The AI reads tool descriptions and matches them to your request. For example, you ask "What's my account balance?" The AI sees a "banking_api" tool, calls it, and gets real-time data. This transforms context from "everything upfront" to "just-in-time information access."

Memory-Enabled Architectures

This is the most exciting area in my opinion, and I'll dig into this more in a later post. Agents are starting to act like intelligent memory managers with capabilities like context compression (summarizing conversations while preserving key decisions), selective memory (keeping important information and dropping irrelevant details), and dynamic switching (loading different context "profiles" based on the current task).

There's also multi-agent coordination where different agents maintain separate contexts for specialized tasks. You might have a research agent that maintains context about sources, findings, and methodologies. A coding agent that tracks architecture, recent changes, and debugging history. A planning agent focused on goals, constraints, and progress tracking.

Why This Actually Matters

Context windows aren't just a technical detail. They're shaping every AI interaction you have today. AI can only hold so much information in its "working memory" at once.

This matters because as AI moves from answering questions to becoming persistent assistants and agents, context management becomes the difference between systems that work and systems that fail halfway through complex tasks. The companies that figure out memory and context persistence first will build the AI assistants everyone actually wants to use.

I'll be writing more about this stuff as I learn more about agents and memory systems. There's a lot more to explore here.