Most Systems You’re Calling Agents Should Be Workflows

The word agent has become the default aspiration. New systems are framed as agentic before anyone has asked whether they need to be, and “we built an agent” now carries a prestige that “we built a pipeline” does not. This is an expensive habit. For the large majority of production work, the correct architecture is the deterministic workflow, and the decision rests on a single question. Does the task require judgment at runtime that cannot be specified in advance? When the answer is no, and it usually is, choosing autonomy buys flexibility you will not use at a cost you will certainly pay.

This is not an argument against autonomous systems. It is an argument for spending autonomy deliberately, against requirements that actually demand it, instead of reaching for it because it is the more impressive-sounding option. The discipline that separates durable systems from fragile ones is the willingness to build the less glamorous thing when the less glamorous thing is correct.

What makes a system agentic

Strip away the vocabulary and one distinction matters above the rest. Who decides the control flow, and when?

In a deterministic workflow, a person authors the control flow at design time. Every branch, every loop, every ordering decision is fixed before the system runs. A language model may do significant work inside that structure, classifying, drafting, extracting, or summarizing, but it never decides what happens next. The sequence is settled before the first request arrives.

In an agentic system, the control flow emerges at inference time. The model inspects its current state, chooses an action, observes the result, and chooses again. The path to the goal is discovered as the system runs rather than written before it. That single property, runtime control over the sequence, is what makes a system agentic, and almost everything else people argue about follows from it.

It helps to treat autonomy as a spectrum rather than a binary. At one end sits the fully scripted pipeline. At the other sits a system handed a goal and left to determine every step itself. Most real systems are neither. They are hybrids, a deterministic skeleton with one or two places where genuine runtime judgment is delegated. Once you see that the choice is not workflow or agent but how much autonomy and exactly where, you can start designing systems you can actually operate. The architectural question, properly posed, is not whether something should be an agent. It is which specific decisions in the system genuinely cannot be made until runtime.

Why runtime judgment is a high bar

Because the whole decision turns on whether a task needs runtime judgment, the phrase deserves a precise definition, since most tasks that feel like they need it do not.

A task requires runtime judgment when the correct next step depends on information that does not exist until the system is already running, and cannot be enumerated in advance. The test is concrete. Given the inputs, could you in principle write down the decision rule, even a tedious one? If you could, if the branching is large but knowable, the inputs bounded, the outputs specifiable, then you do not have a judgment problem. You have a complexity problem, and complexity is something authored control flow handles well.

The confusion arises because sophistication reads as autonomy. A task that involves many conditions, several data sources, and careful handling of edge cases feels like it needs a mind making decisions. But many conditions is not the same as conditions you cannot know ahead of time. A classification step that routes among twenty downstream behaviors is still a fixed routing decision, even when it is doing something subtle. The model exercising fine discrimination inside an authored structure is not the same as the model deciding the structure.

Genuine runtime judgment shows up in a narrower set of situations. Open-ended exploration where the next move depends on what the last move revealed. Environments too large or too dynamic to map in advance. Tasks where the goal itself must be interpreted and refined as work proceeds. Those are real, and they are where autonomy earns its cost. They are also rarer than the prevailing enthusiasm assumes, and most systems labeled agentic are doing authored work wearing an autonomous costume.

Predictability versus adaptability

A deterministic workflow gives you a specific and valuable set of properties. The control flow is fixed, so behavior is predictable across runs. Steps are authored, so they are individually testable. Failures localize, breaking at a named, identifiable point. Cost and latency are bounded and can be estimated before deployment, because the system does not invent new work for itself.

An agentic system trades all of that for a higher capability ceiling. Because the model can plan and re-plan against intermediate results, it can pursue goals that no fixed script could anticipate. But the same property that grants the capability, runtime decision-making, is what removes the predictability. Behavior varies from run to run. Cost and latency become variable, driven by planning loops you did not schedule. Failures no longer localize cleanly, and the root cause can sit anywhere in a long reasoning trace that is far harder to read than a structured execution log.

The mistake most teams make is to weigh these two profiles by their ceilings. They compare the most impressive thing each architecture could do and reach for the more powerful-sounding option. The better comparison is mundane. Which set of operational properties does this specific task actually require? An architecture is not better because its best case is more impressive. It is better because its everyday behavior matches what you need to run, debug, and trust in production.

This reframes a phrase that often gets used as a warning, that a workflow leaves capability unused. It does, and that is the point. Capability you never exercise is not a benefit held in reserve; it is surface area, a set of behaviors the system can produce that you did not ask for, cannot fully predict, and now have to defend against. Every degree of autonomy you grant expands the space of things the system might do, and you are accountable for all of it, not only the parts you intended. Unused capability is headroom in name and unmanaged risk in fact.

A test before reaching for autonomy

Before designing anything as an agent, run the task through three checks.

First, is the input shape known and bounded at the time you build the system? If you can characterize the space of inputs the system will face, you can author handling for them.

Second, is the output specifiable and testable against a clear contract? If you can state what a correct result looks like precisely enough to test it, you can build toward that target deterministically.

Third, can every processing step be defined at design time, without requiring the model to exercise judgment that depends on runtime information no one can know in advance?

When all three hold, a workflow is not merely adequate. It is the correct choice. Selecting it does not forfeit capability. It is a deliberate engineering decision that yields lower cost, more predictable execution, and a system the team can reason about and repair with confidence when something fails in production, as it eventually will.

The bar for autonomy is a no to one of these questions. An input you genuinely cannot constrain, an output you cannot fully specify, or a step whose resolution truly depends on information that only exists once the system is running. And a no to one question does not necessarily make the whole system agentic. It usually localizes the autonomy to a single step, which is exactly where it should be contained.

Complexity is not autonomy

The standard objection is that a real task is too complex for a fixed structure. This conflates two different things. Complexity is about how much branching, coordination, and conditional handling a task involves. Autonomy is about whether the model, rather than the author, decides the path. A workflow can be highly complex while remaining fully authored.

A small toolkit of composable patterns absorbs a great deal of complexity without surrendering control of the flow. Chaining passes the output of one step as the input to the next, in a linear, testable sequence. Routing sends each input to a specialized downstream branch through a classifier, which may itself call a model while the topology stays fixed. Parallelization fans work out across concurrent calls and reconverges at a synchronization point before the system proceeds. Orchestrator-workers decomposes a goal, delegates the pieces to specialized workers, and aggregates their results. Evaluator-optimizer pairs a step that produces a candidate with a step that judges it against criteria, looping until the output clears a quality bar.

The decisive point is that none of these require autonomy. A routing step that calls a model is still a workflow, because a person decided that routing happens here. An evaluator loop is still a workflow, because the loop’s structure was authored, not discovered. The model is making local decisions while the author makes the structural ones. You can handle routing, fan-out concurrency, decomposition, and iterative quality control, most of what makes real systems feel complicated, long before genuine runtime judgment over the structure itself becomes necessary.

How the two architectures fail

The reliability gap is easy to treat as a matter of degree, where agents fail somewhat more often and workflows somewhat less. The real difference is structural, and it matters more.

In an authored workflow, a failure is localized by construction. Each step has defined inputs and outputs, so when something goes wrong, it surfaces at a specific, nameable boundary. You can reproduce it by replaying that step’s inputs, and the behavior will be the same each time. Diagnosis is a matter of reading a structured record of what ran.

In an agentic system, a failure can originate anywhere in a chain of decisions the system made for itself, and the consequences can compound before anyone notices. An early misjudgment shapes the context the next decision is made in, which shapes the one after that. By the time the output is visibly wrong, the cause may be several decisions upstream, buried in reasoning that depended on runtime state you no longer have. Worse, the behavior may not reproduce, because it was contingent on a particular sequence of intermediate observations. Debugging shifts from reading a log to reconstructing a path, a categorically harder activity, and one that does not get easier with scale.

This is why the reliability question cannot be settled by looking at how often each architecture succeeds in a demo. The question is what happens on the bad runs, how quickly you can understand them, and whether you can prevent their recurrence. On all three, authored control flow has a structural advantage that no amount of model quality erases.

The real cost of autonomy

The cost conversation usually starts and stops at tokens, which understates it. Token spend is real. An agent makes unplanned model calls, and planning and re-planning are themselves calls, so its per-run cost is higher and less predictable. But the larger costs sit downstream of that.

There is the engineering time spent diagnosing failures that do not reproduce, which is open-ended in a way that reading a structured log is not. There is the operational burden of a system whose latency and spend vary with runtime decisions, which makes capacity planning and performance guarantees harder to offer. And there is the cost of confidence itself. A system whose behavior you cannot fully predict demands more guardrails, more monitoring, and more human attention to run safely.

These costs recur. They are not a one-time integration tax that amortizes away. They are paid on every run and across every on-call rotation for as long as the system is in service. When autonomy is genuinely required, that ongoing cost is justified by capability that nothing else can provide. When it is not required, you are paying a permanent operational premium for flexibility the task never asked for.

Why agentic systems are hard to test

For many systems, the deciding factor is not cost or even reliability in the abstract. It is whether the system can be tested at all in a way the team can sustain.

An authored workflow is testable the way ordinary software is testable. Each step has a contract, so you can write cases against it, assert on outputs, and catch regressions deterministically. A change either passes the suite or it does not, and the suite means the same thing every time it runs.

An agentic system resists this. Because behavior is not fixed, you cannot assert that a given input produces a given sequence of actions. You can only characterize behavior statistically, across a distribution of cases, with acceptance thresholds rather than exact expectations. That is a real and sometimes necessary discipline, but it is a heavier one. It requires building and maintaining an evaluation harness, curating representative datasets, and accepting that passing is a probabilistic statement rather than a guarantee. Teams that adopt autonomy without budgeting for that apparatus tend to discover, usually in production, that they have built a system they cannot confidently change. Whether a system survives years of modification often hinges on this asymmetry more than on any single run’s quality.

Autonomy and the cost of mistakes

The case for determinism sharpens as the consequences of a wrong action grow. Where mistakes are cheap and easily reversed, a degree of autonomy is tolerable, since an occasional wrong turn costs little and corrects easily. Where actions are irreversible or expensive to undo, the calculus changes completely. The same unpredictability that is a minor annoyance in a low-stakes setting becomes an unacceptable liability when a single autonomous misjudgment can cause durable harm.

This gives a useful design heuristic. Let the reversibility of an action govern how much autonomy you allow near it. Authored control flow and explicit human checkpoints belong wherever the cost of a mistake is high and hard to walk back. Autonomy, where you use it at all, belongs where the system can be wrong cheaply and recover on its own. Architecture should track consequence, not ambition.

When an agent is the right choice

None of this argues that autonomy is never warranted. Autonomy is warranted, and powerfully, when the path to the goal genuinely cannot be enumerated in advance. When the next step depends on what prior steps revealed. When the environment is too large or too dynamic to map. When the task is open-ended exploration whose shape is unknown until it is underway. In those cases a fixed workflow does not merely underperform. It cannot express the task at all, and reaching for autonomy is the correct engineering response.

The mature pattern, even then, is rarely a fully autonomous system. Instead, it is an authored skeleton with a contained agentic pocket. A deterministic structure delegates one bounded, well-scoped subtask to a model allowed to decide its own steps, inside limits the surrounding workflow enforces. The autonomy is real but local, observable at its boundary, constrained in what it can touch, and recoverable if it goes wrong. This is how to get the capability of runtime judgment without surrendering the predictability of the system as a whole. The goal is not to pick a side of the spectrum but to place each part of the system at the right point on it.

Spending autonomy deliberately

Default to determinism. Introduce autonomy at demonstrated gaps, the specific places where an authored approach provably cannot express the task, and contain it when you do. Treat each increment of autonomy as something you justify against a requirement, not something you assume because it sounds advanced. Let evidence rather than aspiration move you up the spectrum, and keep the movement as local as the task allows.

The architecture question is narrower and more useful than the industry’s framing suggests. Does this specific decision require judgment at runtime? For most decisions, in most systems, it does not. Workflows are not the cautious option or the lesser option. For predictable, specifiable, design-time-authorable work, they are the option that is cheaper, faster, auditable, testable, and fixable, the option that holds up over the years a system actually has to run. Autonomy is a real tool with a narrow mandate. The engineers who get the most from it are the ones who spend it deliberately, contain it carefully, and call the rest of their systems what they are.