Decomposition Is the Skill That Decides Whether an Agent Is Reliable

The reliability of an agentic system is mostly settled before the model runs a single step. It is settled in how the work was cut. Hand an agent a large, monolithic goal and it has to invent the structure the goal lacks: it has to decide on its own where one piece of the job ends and the next begins, in what order to attempt them, and how it will know the whole thing is finished. Hand it a goal already broken into well-formed pieces and most of that guessing disappears. The consequence worth dwelling on is that the same model produces a far more reliable system when the work reaching it already has shape. Decomposition is the lever, and it is a design skill, not a model property.

This is worth stating plainly because it inverts where most teams look when an agent misbehaves. The instinct is to reach for a stronger model, a longer prompt, more examples. Those help at the margins. But an agent that fails because it was asked to hold an underspecified, ten-part goal in one turn does not need to be smarter. It needs the goal decomposed so that each part is small enough to execute, verify, and retry on its own. The structure of the work is the architecture. The model is a component inside it.

A capable model still cannot infer the structure a goal omits

A complex goal handed over as a single instruction carries an implicit demand: supply the missing structure yourself. The model has to settle the internal divisions of the work, sequence them, form a picture of what each intermediate result should contain, and judge when the goal as a whole has been met. A capable model can often do a surprising amount of this. But “often” is the problem. Inferred structure is structure the system cannot see, cannot check, and cannot reuse. When the inference is wrong, the failure is opaque, and the only recovery available is to rerun the entire goal and hope the next attempt guesses better.

Decomposition removes the guessing by making the structure explicit and external to the model. When a goal is broken into named subtasks with defined inputs and outputs, three properties appear that a monolithic task can never have. A failed piece can be rerun on its own, so one bad step does not invalidate the work the others already finished. A piece can be checked in isolation, because a unit with a defined output can be measured against that output directly. And a piece can be handed to whichever agent is best suited to it, so a subtask that leans on a particular capability goes to something built for it instead of one generalist stretched across everything.

Recoverability is where most of the practical value sits. In a monolithic task, any failure is a total failure. If the eighth thing the agent was doing goes wrong, the first seven are lost and the whole goal restarts. In a decomposed task, recovery is targeted: you rerun the step that broke, and everything upstream of it stays valid. What this buys is reliability of a structural kind. It does not come from the model making fewer mistakes. It comes from the system being able to contain and recover from the mistakes the model will inevitably make.

The spectrum runs from a fragile monolith to an overwrought pipeline

Decomposition is not a direction you push as far as possible. It is a quantity you tune, and both extremes are failure modes.

At one end sits the monolith. It is genuinely the easiest thing to build: one instruction, one result, no coordination logic, no handoff schemas, nothing to wire together. That simplicity is real and it is why monoliths are the right answer for genuinely simple goals. But a monolith is fragile in proportion to its size. It cannot be parallelized, because it is one unit. It is hard to debug, because the whole process is a single opaque span with no internal checkpoints. And it fails completely, because there is no smaller piece to retry. A large monolithic agent task is a single point of failure that happens to be cheap to write.

At the other end sits over-decomposition, and it is the more seductive mistake because it looks like rigor. Every function becomes its own subtask. Every subtask gets its own handoff. The diagram looks meticulous. But coordination is not free, and past a certain point the system spends more effort shuttling work between steps than doing the work itself. Stretch a job across a long chain of handoffs and the time lost moving results across all those boundaries can overtake the time the work would have taken in a single well-scoped pass. Each boundary adds orchestration logic, a place for a schema to mismatch, and a moment where context can be dropped. Slice the work finely enough and the coordination overhead dominates, and you have built something slower and less reliable than the monolith you were trying to improve on.

The target is the middle, and the heuristic that locates it is simple: each subtask should be a self-contained chunk of work worth running and judging on its own, not a lone operation that belonged inside the step next to it. The right grain is the level at which a piece does one coherent thing whose success or failure you would actually want to observe and act on independently. Above that grain you lose recoverability; below it you pay coordination cost for boundaries that buy you nothing.

Read the goal’s structure before deciding where to cut

Where to place the boundaries is not a matter of taste, and it should not be settled by how the work intuitively seems to divide. It falls out of how the parts of the goal actually depend on one another, and that dependency shape can be worked out before a line of orchestration exists. Three questions expose it.

The first asks what can proceed on its own. If a portion of the goal needs nothing from any other portion to begin, it can run alongside the rest, and spotting these early is where the latency savings come from. Mutual independence is the precondition for any concurrency at all; without it there is nothing to overlap.

The second asks where one part genuinely consumes the result of another. When a later step cannot even begin until an earlier step hands it something, the order between them is fixed by the data itself, not by preference, and no amount of cleverness relaxes it. The trap is to misread such a coupling as optional and let the two run at once. The dependent step then fires before the value it needs exists, works from whatever stale or empty input it finds, and returns an answer that is wrong while looking entirely successful. Nothing throws, which is exactly what makes the failure so hard to notice.

The third asks whether the goal moves through different kinds of competence. A job that must first gather information, then reason over it, then produce an artifact is really three jobs, each wanting its own context and its own tools and, often, its own agent. Those shifts in the kind of work are natural places to cut, because a piece that sits entirely on one side of such a shift is coherent almost by default.

Answering the three questions leaves you with a picture of the goal’s real shape: which parts are independent, which are chained by data, and where the kind of work changes. That picture, not a preference for how a diagram should look, is what the decomposition should follow. The analysis repays the time it takes, because a decomposition is awkward and costly to rework once orchestration has been built around it. The boundaries you settle on become load-bearing.

A well-formed subtask has four properties

Knowing where to cut is half the problem. The other half is whether each resulting piece is actually a clean unit. A subtask earns its place as an independent unit when it has four properties, and a subtask missing any one of them tends to cause failures that are hard to trace back to their source.

The first is single responsibility. A subtask should carry exactly one job and emit exactly one result. A unit that quietly does two things is harder to test, because a single output no longer tells you which of its two jobs succeeded, and harder to retry cleanly, because retrying it redoes both jobs even when only one failed.

The second is defined inputs. Everything the subtask acts on arrives through an explicit set of inputs, and it leans on nothing outside them. In particular it must not reach for context it was never actually handed, the leftover state of some earlier exchange it had no part in. This is what makes a subtask portable and its behavior reproducible: give it the same inputs and it does the same thing, because there is no hidden state left to shift the outcome.

The third is bounded scope. A subtask should be small enough that a failed attempt is cheap to repeat and that repeating it never drags the whole workflow back to its starting point. Scope is what ties the four properties back to the reliability argument. The entire point of decomposition was targeted recovery, and a subtask grown too large quietly hands that benefit back.

The fourth is clean, structured output. What the subtask returns should be shaped so the orchestrator can act on it as is, rather than loose prose a later step has to read and decode. Structure is what keeps a boundary mechanical instead of turning it into a second round of inference. The moment a downstream step has to interpret the meaning buried in an upstream step’s text, the ambiguity decomposition was meant to remove is back.

The handoff is where decomposed systems actually break

Decomposition creates boundaries, and the boundary is where decomposed systems most often come apart. That boundary is the handoff, the join where one subtask’s output turns into the next one’s input. It earns as much design attention as the subtasks on either side of it, because a clean set of subtasks strung together by careless handoffs is still an unreliable system.

The first rule of a good handoff is that only what the next step requires should cross it. The temptation runs the other way, toward forwarding everything, emptying the whole accumulated context over the boundary so the receiver “has what it might need.” That impulse backfires. An agent handed its entire upstream history has to hunt for the part that matters inside a mass of material that does not, and the noise pulls at its attention the whole time. A tight handoff carrying only the needed inputs is not merely smaller; it produces better behavior downstream, because the receiver is not being distracted by everything it was handed for no reason.

The second rule is that what crosses the boundary should be a defined, structured payload rather than loose text. A defined shape turns the agreement between two steps into something explicit and checkable by machine. Loose text turns every handoff into one more act of interpretation, and interpretation is precisely the brittle operation decomposition set out to remove. When both sides commit to the same shape, a violation of it surfaces as a detectable error instead of passing through as a quiet misreading.

The third rule is the one teams skip most often, and it is to spell out what the handoff leaves behind. A subtask has no inherent view of all that the orchestrator holds, and the holes in what it knows are invisible from where it sits. Leave those holes unmarked and the subtask papers over them with guesses; guesses about context it does not have go on to produce errors that stay silent and resist debugging long after the run, since nothing in the wreckage points back at the information that was never sent. Stating plainly what a step will not receive is what keeps it from confidently filling the absence with invention.

Every boundary is a cost you keep paying

The case for decomposition is strong enough that it is easy to forget it is not free, and the discipline lives in remembering the cost. Every boundary added to a workflow buys reliability and charges complexity, and the charge recurs for the life of the system.

Each new subtask is more orchestration logic to write and keep working. Each handoff adds delay, since moving a result across a boundary costs time the underlying work never asked for. Each boundary is one more thing that can break: a shape that fails to line up, context that slips through, a half-finished failure that strands the workflow in a muddled middle state no single step owns the job of clearing. And the more finely a system is divided, the harder it is to follow when it misbehaves, because the thing under investigation is no longer one agent’s reply but a chain of them, and you are left reconstructing which step saw what and where the run first parted from what you expected.

So the balance being struck is reliability set against complexity, and it comes with an explicit stopping rule. Carry the decomposition far enough to win back the recoverability, the delegation, and the testability that were the reason to decompose in the first place. Then stop, before the cost of coordinating the boundaries grows larger than the worth of the work they hold apart. A boundary that cannot justify itself through real recoverability or real specialization is not rigor. It is overhead, and the system pays it on every single run.

The structure is the architecture

The durable point is that an agent’s reliability is a property of how its work is structured, far more than a property of the model executing it. Decomposition is the discipline that sets that structure: reading a goal’s real dependencies before cutting it, sizing each subtask so it carries a single responsibility with defined inputs and a clean output, designing each handoff as an explicit contract, and adding boundaries only where the payoff is real. Done well, none of this makes the model smarter. It makes the system around the model able to contain failure, recover precisely, and stay inspectable as it grows. That is why decomposition is the core skill of building agents rather than a preliminary step before the real work. The decomposition is the real work. Everything downstream inherits its quality.