Tool Use, Not Intelligence, Is What Makes a Model an Agent
A language model, on its own, cannot do anything. It can produce text, and that is the entire extent of its reach. It cannot read a file, query a database, send a message, or check whether the claim it just made is true. The most capable model ever built, handed a goal it understands perfectly, can still only respond by writing words about it. The leap from that to an agent, something that pursues a goal by acting in the world and adjusting to what comes back, is not a leap in intelligence. It is the addition of tools. Tool use is the mechanism that converts a text generator into an actor, and everything people mean when they call a system “agentic” follows from it.
This is worth stating plainly because the industry’s intuition runs the other way. The assumption is that agency scales with model capability: a smarter model is a more agentic one. Capability matters, but it is not what makes the difference in kind. A weaker model with tools can act, observe, and recover. A stronger model without them can only describe what it would do. The dividing line between a system that does things and a system that talks about things is not drawn through the model at all. It is drawn at the tool layer.
A model on its own is a closed system
Consider what a model actually has access to during inference. It has the text in its context window and the patterns encoded in its weights. From those, it produces more text. There is no other channel. It cannot observe the consequences of its own output, because by the time the output exists, the model’s turn is over. Whatever it generated is now someone else’s problem to act on. The model is, in the most literal sense, closed off: information flows in as context and out as text, and there the loop stops.
That sealing is the constraint tool use exists to break. A tool gives the model a way to emit something other than a final answer, a structured request that a surrounding system will act on, and then to receive the result of that action back as new context on its next turn. The model still only produces text and still only reads text. What changes is that some of the text it produces is now interpreted as an instruction to the outside world, and some of the text it reads is now a report of what happened when that instruction ran. The model has not gained any new internal faculty. It has gained a membrane, a controlled boundary across which its words can cause effects and the effects can be reported back. Agency lives at that membrane, not inside the model.
This is why “a more capable model” is the wrong place to invest if what you want is a more capable agent. Raising the model’s reasoning ability sharpens the decisions it makes inside the membrane. It does nothing about what the membrane lets through. An agent that cannot reach the system it needs to act on is blocked no matter how well it reasons, and an agent whose only tool returns garbage will act on garbage no matter how smart it is. The ceiling on what an agent can accomplish is set jointly by the model and by what its tools let it touch and learn, and for most real systems the second factor is the one left underdesigned.
From the model’s side, a tool is only a description
The word “tool” invites a misconception that is worth clearing up, because it shapes how you design one. From the model’s side, a tool is not code. The model never executes anything. What reaches the model at inference time is purely descriptive: a label, a plain-language account of the tool’s purpose, and the shape of the arguments it will accept. That definition is text, and the model treats it the way it treats all text, as something to read and reason over. When the model decides a tool applies, it produces a tool call: a structured object naming the tool and supplying arguments. That object is still just generated output. Some other system has to recognize it, run the real function behind it, and hand back the result.
The consequence is that the model’s entire understanding of a tool comes from its description. It does not see the implementation, the database schema behind a query tool, the API the call ultimately hits, or the validation the runtime will apply. If the description says a tool “processes data,” the model has no reliable basis for deciding when to reach for it, because that phrase is true of almost anything. A description that pins down the tool’s effect, the inputs it requires, and the form of its result gives the model a real basis for judging when the tool fits. The description is not documentation that sits alongside the tool. For the purpose of selection, the description is the tool, the only version of it the model can perceive.
This also means a tool call is a proposal, not an action. The model emits its best guess at a well-formed request, and that guess can be wrong: a missing required argument, a value of the wrong type, a tool chosen that does not fit the situation. Nothing about the model generating a tool call guarantees the call is valid or sensible. The system on the other side of the membrane is responsible for checking it before anything irreversible happens. Treating the model’s output as an intention to be validated, rather than a command to be obeyed, is the posture the rest of an agent’s reliability depends on.
The model proposes and the runtime disposes
Every tool call crosses a boundary, and the boundary is the most important structural fact about tool use. On one side is the model, which decides what it wants to do and expresses that as a structured request. On the other side is the runtime, the surrounding program that receives the request, decides whether to honor it, executes the underlying function or API call, captures the result, and formats that result as an observation to inject back into the model’s context. The model proposes. The runtime disposes. These are two separate parties with two separate responsibilities, and conflating them is the source of a whole class of design mistakes.
Keeping them separate is what makes an agent governable. Because execution happens in the runtime and not in the model, the runtime is where every guarantee an agent needs has to be enforced. Whether a destructive tool call requires confirmation, whether arguments are validated against their schema before the function runs, whether a tool is even available to this agent in this context, whether the call is logged for later audit, all of that is the runtime’s job, and none of it can be delegated to the model’s good intentions. The model can be instructed to be careful, and it will usually comply, but instruction is not enforcement. The reason the proposal-and-disposal split matters so much is that it is the only place where control is mechanical rather than probabilistic. Put the safety check in the prompt and you have a strong suggestion. Put it in the runtime, at the moment of disposal, and you have a guarantee.
This boundary is also where an agent’s trust model lives. The model is a nondeterministic component that will, eventually, propose something you did not anticipate. The runtime is the deterministic component that decides what proposals are allowed to take effect. Designing an agent responsibly means assuming the model will sometimes ask for the wrong thing and ensuring the runtime is positioned to catch it. The membrane is not just a channel for action. It is the control point.
Who chooses the tool is the whole distinction
Tools alone do not make a system an agent. A program can call functions in a fixed order and involve a model at every step without being agentic in any meaningful sense. The property that matters is not whether tools are used but who decides which tool runs and when. That single question separates a workflow from an agent, and it separates them cleanly.
Where the control flow originates is the dividing line. When an author has written the order of operations into the program, the model is a participant inside a structure it cannot alter: it can supply a parameter, reshape an intermediate value, or produce a span of text, but what runs next was settled before the first request arrived, and the route is identical on every pass. When instead the model picks each tool from the situation in front of it, reading what it has done and what remains, the order is assembled live and differs from one run to the next. Handing that choice to the model is what earns a system the name agent. Count the model’s calls if you like; the number settles nothing. A program that calls a model at every node and still dictates the order of those calls is a workflow. A program that lets the model set the order is an agent, however few tools it holds.
Naming this precisely has a practical payoff, because the two profiles are opposites. When the order of tool calls is fixed, you can walk every path the system can take and account for each one, because you authored them. When the model sets the order, the reachable paths multiply past the point of walking them, and the discipline shifts from listing behaviors to bounding them. That loss of a fixed structure is the real price of runtime tool selection. What you buy is an agent that can meet a situation no one scripted for, reaching for a tool in an order no one specified because no one foresaw the need. Whether the trade is worth it is a design decision rather than a default, and the honest way to weigh it is to admit that what you are deciding is whether to hand tool selection to the model at all.
Reading, computing, and writing are three different kinds of risk
Once a model can call tools, the tools it can call are not interchangeable, and lumping them together hides the part of the design that matters most. It helps to sort them by what they let the model do to the world. Some tools let the model read: look up a row in a datastore, retrieve a document, search a knowledge base. These bring information across the membrane into context, and their characteristic failure is returning the wrong information, stale, irrelevant, or misleading, which the agent then reasons over as if it were sound. Some tools let the model compute: run a calculation the model cannot do reliably on its own, parse a structured payload, execute code in an isolated environment. These extend the model’s capability beyond text generation, and their failure modes are the failure modes of the computation itself. And some tools let the model write: post a message, update a record, kick off a downstream job. These change the state of the world, and they are categorically different from the other two, because their effects do not stay inside the agent.
The asymmetry between reading and writing is the one to design around. A read that goes wrong pollutes the agent’s context, and a well-built agent can often notice the bad data and recover, because the damage is contained in its own working state. A write that goes wrong changes something outside the agent that may not be reversible, and no amount of subsequent reasoning undoes a message already sent or a record already deleted. This is why the functional category of a tool is the first thing to establish about it, before its description or its schema. The category tells you how much the runtime needs to scrutinize the call before honoring it. Read tools can often be permitted freely. Write tools, especially destructive or externally visible ones, are exactly the proposals the runtime should be slowest to dispose of without a check. An agent’s blast radius is not set by how smart its model is. It is set by which write tools you put within its reach.
The return value is where an action becomes information
A tool call that produced no return value would be a strange thing to give an agent, because the return is what makes tool use part of a loop rather than a one-way emission. After the runtime executes a call, it captures the result and injects it into the context the model will read on its next turn. That injected result, the observation, is what turns an action into information the agent can use. The agent does something, sees what came of it, and decides its next move in light of that. Acting without observing would be acting blind.
This is the part of tool use that most distinguishes an agent from a function that happens to be called by a model. A successful result confirms progress, and the agent moves on. A failure is not a dead stop but a fact to absorb: the agent sees the call did not work and can adjust its arguments, abandon this tool for another, or judge the obstacle out of its reach and hand the task up. A result that comes back incomplete invites a narrower follow-up to fill the gap before the agent commits to a next step. What the agent can do next is constrained by how legible the last result was. A tool can finish its job and still hand back a sprawling, unstructured dump the agent has to dig through, or it can hand back a tight, readable result it can act on directly. That difference does not change what the tool did. It changes what the agent can do afterward. The full mechanics of how perception, reasoning, action, and observation chain into a repeating cycle are a subject of their own. What belongs here is the narrower point that an action is only as useful as the observation it produces, and that the return path across the membrane deserves the same design attention as the request.
Designing for agency means designing the tool layer
If tool use is what constitutes agency, then the tool layer is where the design effort belongs, and that is a reallocation of attention from where it usually goes. The instinct is to treat the model as the system and the tools as plumbing attached to it. The relationship is closer to the reverse. The model supplies judgment, but the tools determine what that judgment can perceive, what it can affect, and how cleanly the results of its actions come back. An agent is only as capable as its tools let it be, only as safe as its runtime makes their execution, and only as reliable as the observations those tools return.
That reframing redirects the practical questions. The leverage is not in finding a more capable model, though a more capable model helps at the margins. It is in writing tool descriptions precise enough that the model selects correctly, defining schemas strict enough that the runtime can reject malformed calls, sorting tools by what they let the agent do to the world and gating the dangerous ones at the point of execution, and designing return values that land in context as signal rather than noise. None of that is model work. All of it is the work of building the membrane through which a closed text generator reaches the world and reads back what it found. Build that membrane well, and a capable model becomes a capable agent. Leave it underbuilt, and the most capable model in the world stays exactly what it was on its own: something that can only tell you what it would have done.
