The Context Window Is the Only Working Memory a Model Has

A language model has no memory of you. It has no memory of the last thing it told you, no running notion of the task, no internal notebook that persists from one call to the next. What looks like memory is an illusion reconstructed on every request. The single place that illusion is assembled is the context window, and that window is the entire working memory the model gets. Everything the model can reason over has to be inside it. Everything outside it does not exist. Systems that scale treat this window as a scarce, actively managed resource. Systems that break treat it as a text box you pour content into until something goes wrong.

The distinction matters because the failure is rarely loud. A model that has lost the thread does not throw an exception. It answers confidently from whatever happens to be in front of it, which may be missing the one fact that would have made the answer correct. Understanding what the window holds, what competes for space inside it, and how it degrades as it fills is the foundation for every other decision in building a reliable agent.

Everything the model reasons over lives in one window

The context window is the complete set of tokens available to the model on a given forward pass, and that set is larger than most people assume. Its contents reach well beyond the conversation. The standing instructions that define the model’s role and rules are in it. Every tool the model has been given, expressed as a definition it can read, is in it. So are the documents and data injected for the task, any attached metadata or running summaries, the full back-and-forth of the exchange so far, and the model’s own text as it is being generated. All of it shares one window and is counted in one unit.

The defining property is that the boundary is absolute. Content inside the window can be attended to. Content outside it might as well not exist. There is no background store the model consults when the window comes up short, no graceful reach for information that was true two requests ago but is no longer present. The model is, at the level of a single call, a stateless function: given this exact window, produce the next tokens. Continuity across a conversation is something you engineer by feeding the relevant history back in every time. The model is not remembering. You are reminding it.

This reframes what working memory means here. In a running program, working memory is the data structures currently in scope. For a model, the context window is precisely that scope, and it is rebuilt from nothing on each call. Nothing carries over implicitly. If a fact matters to the next response, it has to be physically present in the next window, or it is gone.

The user’s message is a small slice of what the model is processing

Most reasoning about the window starts from the wrong place, with the user’s input, and treats everything else as background. In any non-trivial system that input is the smallest moving part. The standing instructions alone can run to hundreds or thousands of tokens before anyone has said a word. Every additional turn carries forward all the turns before it, so a long exchange compounds on itself. A retrieval step can pour in document chunks that dwarf the question that prompted them. Add the room taken by tool definitions, attached metadata, and earlier summaries, and the picture inverts: by the time a person finishes typing a sentence, most of the budget is already committed elsewhere.

This has a direct consequence for design. The components you control and rarely look at, the standing system prompt, the accumulated history policy, the volume of retrieved context, are usually the dominant consumers of the window, not the live input you are focused on. Optimizing the window almost always means auditing the parts that are stable and invisible, not trimming the user’s words.

Input and output draw from the same account

The window has a single ceiling that combines everything the model reads and everything it writes. Input and output are not separately budgeted. The tokens the model generates count against the same limit as the tokens you sent, which produces a tension that is easy to miss until it bites.

When a model reasons at length before answering, whether through an explicit reasoning phase or a scratchpad it writes to itself, that reasoning is output, and it consumes the window as it is produced. A request that leaves generous room for input can still run out of room mid-generation, because the model spent its remaining budget thinking before it reached a conclusion. The practical discipline is to treat output headroom as a line item you reserve deliberately, not as whatever happens to be left over. A request with no explicit bound on generation is a request whose total footprint you do not actually control.

Tokens are the unit, and the same idea costs different amounts to represent

Everything in the window is counted in tokens, the sub-word fragments the model actually processes. The number of tokens a piece of content costs is not a fixed function of its length in characters, and the variation is large enough to be an architectural concern.

How cheaply content reduces to tokens depends on how well represented it is in the model’s vocabulary. Ordinary prose maps to few tokens per word, because the words and word-pieces it draws on are common. Source code lands in roughly the same range, since its keywords and recurring identifiers are also well covered. The cost climbs sharply for heavily structured data: every delimiter that makes a format machine-readable, the punctuation and nesting and quoting, is itself a token, so the same values wrapped in a verbose envelope can cost far more than the values alone. Some natural languages are more expensive again, taking several times the tokens to carry a meaning that a well-represented language conveys in a handful.

The lesson is that the format in which you inject content is a budget decision, not a cosmetic one. When you are choosing how to serialize retrieved data, how to represent tool results, or how to carry state between turns, the encoding you pick changes how much of the model’s working memory you spend to say the same thing. Choosing a leaner representation is one of the cheapest wins available, and it is invisible until you measure it.

A full window fails in three different ways

A full window is not a single event but three different ones, and which one you get is a property of your system rather than the model. They are worth separating because they fail in different ways and call for different defenses.

The first is the hard ceiling. When the combined input and output would exceed the limit, the call cannot proceed as-is. Depending on the implementation, it either fails outright with an error or stops generating once it reaches the boundary. The important thing a well-behaved interface does not do is silently discard your earlier turns to make room. The boundary is surfaced, not hidden. That is the benign case, because it is visible. You find out at request time and can respond.

The second is naive truncation, and it almost always lives in the application layer, not the model. Facing a request that would overflow, the surrounding system evicts whatever is oldest to make room. That heuristic holds right up until the evicted turn is the load-bearing one: the constraint set in the opening message, a decision reached many exchanges back, a correction the user made once and assumed would stick. Equating age with irrelevance is convenient and frequently wrong. Nothing errors. The system answers on, now blind to context it has no way to know it discarded.

The third mode is the worst, because nothing breaks at all. When a window runs very full, a model’s attention spreads unevenly: material sitting in the middle of a long context tends to register less reliably than material near the start or the end. The text is still there, fully inside the window, yet the model behaves as though it half-forgot it. There is no error to catch. What surfaces instead is a model that sometimes skips an instruction it was plainly given, or misses a detail sitting in plain view, for no reason the logs can explain. A short test rarely provokes it; production, where contexts run long, provokes it routinely, and the casualty is usually the instruction that was buried deepest.

These three modes share a root cause. A window managed by hope rather than by plan will eventually fill, and when it does, the system has no good options left. It can fail, it can guess at what to discard, or it can quietly degrade. None of those is a decision you want made implicitly at the worst possible moment.

Larger windows relocate the constraint, they do not remove it

The obvious response to a finite window is to reach for a bigger one, and window sizes have grown substantially across model tiers. It is worth being precise about what that buys. A larger window raises the ceiling. It does not change the physics underneath it. The same competition for attention, the same drop in mid-context recall, the same need to decide what earns its place all persist at the larger scale. You can fit more in, which means you can also waste more, dilute more, and bury the important content under more noise.

There is also a quieter trap in assuming that the most capable model carries the largest window. Reasoning capability and window size are separate axes. The tier you would choose for the hardest reasoning is not automatically the one with the most room, and the tier with the most room is not automatically the strongest reasoner. Picking a model means matching both the context the task genuinely needs and the reasoning it demands, and those two requirements do not move together. A bigger window is more headroom to manage well, not permission to stop managing.

Budget the window like the scarce resource it is

The alternative to hoping content fits is deciding in advance what fits, which is all a context budget really is. Each component is handed a ceiling before any request is assembled: a cap on the standing instructions, a cap on retained history, a cap on retrieved material, and a reserved slice for output that nothing else is permitted to borrow. Those ceilings live in code, held by trimming, summarizing, or selecting down to the allowance, rather than left to the hope that the content turns up small enough to fit.

In practice this resolves into a few habits that reinforce each other. The most basic is measurement: track token consumption on every call, so a system prompt that has quietly doubled or a history that is creeping upward surfaces on a chart well before it surfaces as an incident. Retained history is the line item that grows on its own, so it gets capped by rolling older exchanges into a running summary and keeping only the most recent ones intact, which holds the conversation’s footprint flat instead of letting it climb with every turn. Retrieved material is admitted on relevance, not volume: a few passages that genuinely bear on the request, since each one that does not still occupies room and still pulls on the model’s attention. And generation carries an explicit limit of its own, which both defends the room reserved for input and keeps a single response from expanding without end.

None of this is exotic. It is the same discipline you would apply to any finite, contended resource: decide the allocation in advance, enforce it mechanically, and instrument it so you see pressure building before it fails. The window is memory, and memory you do not manage is memory that fills with whatever arrived most recently.

The window is memory you allocate

The context window is the one resource every inference call cannot escape. It is the model’s entire working memory, finite and rebuilt from scratch on each request, and everything the model can reason over has to fit inside it while competing for the same space and the same finite attention. The user’s message is a fraction of it. Input and output share its ceiling. Different content costs different amounts to hold. And it degrades in three distinct ways as it fills, one of which is silent.

The teams that run models reliably share one shift in stance. The window is not a container you keep pouring into until it spills. It is working memory you design, with an allocation you set and enforce, every token earning its place against everything it crowds out. Every later move in context management, summarization, external memory, retrieval tuning, private scratchpads, is a tactic in service of that one principle. The model remembers nothing on its own. What it has to work with is exactly what you put in front of it, and choosing well is the entire job.