The Description, Not the Name, Decides Which Tool an Agent Calls

When an agent reaches for the wrong tool, the instinct is to inspect the code: a broken function, a malformed schema, a name that should have been clearer. Almost none of that touched the decision. By the time any tool executes, the model has already settled on it, and it settled by reading natural language. The thing that actually drove the choice is the description, the prose attached to each tool that explains its purpose and the situations it is meant for. A name only nudges. A schema constrains the arguments after the tool is already chosen. The implementation is invisible to the model entirely. Whether the right tool gets picked is, more than anything else, downstream of how well its description reads, and teams that treat it as anything else spend their time debugging a layer where the decision was never made.

This is worth stating bluntly because it inverts a reasonable engineering intuition. We are used to dispatch being mechanical: a name maps to a function, a type signature constrains the arguments, and the compiler or the router enforces the match. Tool selection in an agent works nothing like that. There is no dispatch table. There is a model reading a set of descriptions and deciding, in language, which one fits the request in front of it. Once you internalize that the routing layer is prose and the matching is semantic, the entire problem becomes tractable, because prose is something you can write deliberately and test.

A tool call begins as a matching problem

Consider what the model is actually working with when a request arrives. Each tool available to it sits in context as a block of prose, and the model takes in all of those blocks at once. It is not scanning for matching keywords. It is judging which description, read in full, best fits what is being asked, and it calls the tool behind that description. Nothing else in the loop has a vote. The model never runs a few candidates to compare their results, and it has no access to the logic underneath any description. The selection is made on the reading and nothing more.

Two properties of this process shape everything that follows. The first: there is no magic word. A short description padded with the right nouns will not beat a fuller one, because the model is weighing each description as a whole rather than scanning for a trigger token. Loading a description with keywords buys nothing if the prose around them does not actually account for what the tool is and when it applies. The second: the judgment is made anew on each turn. The model carries no memory that it preferred a given tool a moment ago and no bias toward repeating that choice. Every request is scored against the current set of descriptions from scratch, which has an unsettling implication: a tool that wins its requests today can quietly start losing them the day a similar tool joins the set, even though its own definition never changed.

A request can end up in one of four places, and each points back to the same lever. It routes cleanly, because one description was sharp enough to make its tool the obvious fit. It misroutes, because two descriptions blurred together or one was too thin to stand apart. It lands nowhere, because nothing on offer captured what was being asked. Or it stalls into a question back to the user, because the model could not commit and chose to ask instead of gambling. That last outcome is the only failure that fails gracefully, and it is still a failure: the descriptions did not hand the model enough to decide. All four trace to how the prose was written, which is the whole reason the description deserves the attention.

The name is for humans, the description is for the model

The most common form of this mistake is trusting a clear name to carry the routing. It cannot, for a structural reason. A name is a handle built for a person scanning source, compressed precisely to the point where it stops being useful to a model that has to separate one tool from several believable rivals. Picture a tool whose name is some single verb like “lookup.” Behind that name could sit a query against a search index, a fetch from a relational store, a walk over the filesystem, or a nearest-neighbor query over embeddings. The name commits to none of them, and the model will not reverse-engineer the intent from a naming habit. Give two tools names that read alike but do unlike work and they will contend for the same requests, won by one tool today and the other tomorrow, until the descriptions are what finally tells them apart.

It also explains a result that surprises people: an immaculately named tool wrapped around a spotless schema will still be passed over when its description is thin. The name and the schema earn their place in a tool definition, for human readability and for assembling a valid call once the choice is locked in. Neither one is what tips the selection, and neither one redeems a description that never says what the tool is for. Where the tool falls in the list will not rescue it either. The moment you catch yourself explaining a misroute in terms of names or argument types, you have wandered into the wrong layer. The decision happened in prose, so the repair happens in prose.

What a description has to contain to route

A description that routes well carries three things, and teams reliably ship the first two and forget the third. One: what the tool does and what it hands back, which fixes its identity. Two: the circumstances under which it is the right call, which is what pries it apart from neighbors covering adjacent ground. Three: the circumstances under which it is the wrong call, the stated edge that sends the model elsewhere. That third part looks like dead weight while you are writing a tool on its own, since on its own there is nothing for it to be confused with. It stops looking like dead weight the instant a second tool in a nearby domain joins the set, because then the boundary is the one thing letting the model keep the two straight. Omit it and you have planted a routing gap that hides until the day those two tools start fighting over live traffic.

Past those three, the description should also hint at what goes in and what comes out, because both shape the decision. Hinting at the inputs is not transcribing the schema into the prose. It is describing the arguments in words: that the tool wants a key of a certain kind, or a bounded time window, or a value drawn from a small fixed set. Done well, this does double duty, steering the model toward the right tool and toward a well-formed call, which are not the same win. The output hint completes the picture. Selecting a tool is, underneath, a prediction about what the call will return, and the model cannot make that prediction if the description stays silent on it. Spell out the shape of the return: one record or many, a flag or a status, something usable as-is or something that still needs work, and how a failure shows up. A description that pins down its effect, its triggers, its limits, its inputs, and its outputs leaves almost no slack for the wrong choice.

Stated as failures, those same three gaps are what you actually find in the field, and naming them converts a vague complaint that routing is flaky into a specific edit. A description that announces what a tool is but never when it applies cannot be told apart from anything in its neighborhood. One that never marks what the tool refuses to handle gives the model no edge to route around. One that says nothing about the return leaves the model unable to check that the call even answers the request. Three concrete holes, each closable with a sentence.

Two shapes of selection failure

When routing breaks it tends to break in one of two shapes, and they leave different fingerprints. In the first, nothing fits. The model finds no description that genuinely answers the request and has to do something anyway. The benign versions are loud: it names a tool that was never defined, or it declines outright. The version that should worry you is quiet. It picks the closest miss, runs it, and returns something that looks right and was wrong from the first token, surfacing only later as a result nobody can account for. The cure is seldom a new tool. It is usually an existing description rewritten to actually cover the request, in the words people use when they ask for it.

In the second shape, two tools fit about equally. With two strong contenders and nothing decisive to separate them, the model splits its choice across requests that look the same in any log. This one is harder to pin down precisely because the inputs do not visibly differ: the wobble lives in how close the two descriptions are, not in anything about the request. To a user it reads as the same action behaving differently from one day to the next, and that inconsistency corrodes trust faster than a clean error would. The fix is to make one tool the clear winner for the requests it owns, and that is writing, not coding.

Disambiguation is something you write

When two tools live in the same domain, you pull them apart on purpose, and a few techniques do most of the work. The strongest is to make each description open on an action word that none of its neighbors use. When one tool’s first word is unmistakably about reading and another’s is unmistakably about removing or rewriting, the model has a sharp signal of intent before it reaches the second clause. Reinforce that by stating scope as a fence, not just a capability. A line that says a tool reads from a given store and changes nothing in it does more at the moment of a tie than any amount of describing what it does do, because the part the model needs to break the tie is precisely the part about what it will not touch. Then tie each tool to one slice of the data, so this tool answers for billing and that one for accounts and a third for stock levels, giving the model a clean domain marker to aim at and keeping adjacent tools from melting into one another.

Descriptions do the bulk of the routing, but the broader instructions can back them up where two genuinely close tools refuse to come apart on description alone. You can set aside a stretch of the system instructions to lay out the tools side by side and say, in one place, which to favor over which. For two tools that are close by their nature, where no edit to either description fully settles it, a stated tie-breaker decides on a basis you pick instead of leaving it to the toss. Spelling out when a given tool is the wrong call hands the model footing for exactly the ambiguous cases the descriptions leave open. All of this reinforces; none of it substitutes. If the instructions are carrying the routing on their own, the descriptions are underbuilt, and the scaffolding will collapse the next time you add a tool.

Descriptions are artifacts you evaluate

A well-written description gives you a router that ought to work. Proving it works across the spread of ways real requests show up is a separate job, and skipping it is how a team learns about its routing from a support queue rather than a test run. Three kinds of probe cover the ground. Throw the same intent at it in several different wordings and check that the same correct tool wins each time, which measures how much the routing leans on exact phrasing. Throw it requests this tool was never meant to serve and check that the model steps aside or asks rather than forcing the call through, which measures the boundary. Throw it requests that could honestly go to either of two tools and check that the right one wins on what its description says, not on where it sits in the list or how its name reads, which measures disambiguation head-on. Each probe catches what the other two let pass, and running all three before a tool set ships turns a hope that routing holds into evidence that it does.

Routing quality will not stay good on its own. Each tool you add, rename, or rewrite resets the selection, since the model always weighs every description against every other. How people word their requests moves over time as well. Put together, a tool can slip from winning a kind of request to losing it without a sound, and with no instrumentation you will never watch it happen. The cheap insurance is to record, for every call, which tool ran and what was asked to trigger it. When one tool’s share of calls drops off a cliff while what users want has not moved, something in the routing has shifted under you, usually a freshly added neighbor poaching the requests or an edit that muddied a boundary. Counting that log as part of the system rather than a nice-to-have is what lets you catch the slide before it turns into an incident.

Refine the description before you add a tool

The practice that falls out of all this is simple to say and routinely skipped. When a tool is routing badly, the first impulse is to build another tool to patch the hole. That impulse is almost always wrong. Another tool means another description that has to be read on every single turn, so it taxes every request, and it opens fresh seams of overlap with everything already in place. The question to ask first is whether the tool you already have would catch the case if its description were tighter. Most of the time it would, and sharpening the prose fixes the routing without growing the surface at all. Save a new tool for a need that truly sits outside everything you have, and when that day comes, write the newcomer’s description and rework its neighbors’ in the same sitting, so the lines between them are clean on arrival instead of patched up after the first time they collide.

The throughline is that tool selection is a language problem wearing the costume of a systems problem. The model is not inspecting your code, comparing type signatures, or honoring a dispatch table. It is reading prose and deciding what fits. That makes the description the highest-leverage artifact in the tool layer and the first place to look when routing goes wrong. Names help humans. Schemas shape the call once it is chosen. Implementations do the work after the decision is already made. But the decision itself is made on the description, every turn, by a model doing nothing more exotic than reading. Write the descriptions as if the reliability of the whole system depends on them, because it does.