Designing Extraction Schemas the Model Can Satisfy

A schema attached to an extraction task looks like a description of the output you expect back. It is really a set of instructions the model reads while it generates, and it presses on what comes out. That reframing is the whole subject. A schema is not a passive record of the shape you hope to receive and inspect afterward. It is an active input to generation, and every decision encoded in it either helps the model produce the right value or pressures it toward a wrong one.

Holding shape apart from correctness sharpens the point. Constraining generation to a schema can guarantee the shape: the declared fields present, correctly named, correctly typed. It says nothing about whether the values in those fields are true. A guarantee of shape laid over a poorly designed schema is worse than a parse failure, because a parse failure at least announces itself. What you get instead is output that validates cleanly, carries every field in the right type, and is wrong in ways nothing downstream will catch. Schema design is the work of closing the distance between a shape the mechanism can enforce and a shape the model can actually fill correctly from the inputs you will really see. The enforcement layer is settled elsewhere. What remains, and what determines whether extraction survives real documents, is the design of the schema itself.

The required and optional line carries the most weight

A required field encodes a specific promise: whatever the input, this field will come back filled. The promise holds only when the source material actually carries that value on every single document. Where it does not, the schema asserts a presence the input cannot supply, and the model is left to reconcile two things that cannot both be true. It reconciles them by filling the field regardless. What it fills it with tends to look reasonable, and that is precisely the danger, because a reasonable-looking value survives the review a garbled one would fail and detonates later. Name the mechanism plainly. A required field the input cannot support does not protect you from missing data. It directs the model to manufacture it. This is not a lapse to be corrected with better prompting; the schema itself is demanding a value where the source holds none.

The opposite error is quieter and just as real. Mark too little as required and you have declared almost nothing about the output, which means every downstream consumer has to treat every field as possibly absent and guard accordingly. You have not removed the uncertainty about structure. You have relocated it from the schema, where you could decide it once, into runtime code that now handles structural variation on every call. Optional is not a way to avoid a decision. It is a decision, and it should be made for the fields where absence is a genuine, expected outcome rather than as a reflex to reduce risk.

The rule that falls out of this is deliberately narrow. A field belongs in the required set only when the source is certain to carry it on every input you will ever process, never merely because a value would be convenient to have. Everything short of that certainty is optional, and marking it so is not an absence of a decision but a positive one: it declares to the model and to every later reader of the output that an empty result here is legitimate, not a defect to be smoothed over. No other field-level choice carries as much weight, because this is the single point in the design where a decision made at the desk can directly produce false data at runtime.

A field description is extraction logic, not documentation

The field name is a label. The description is where the extraction is actually specified, because at generation time the model treats it as the working definition of what it is hunting for. Written off as a note to future maintainers, the description squanders the most direct control you have over what the model pulls out.

Consider what a bare name leaves open. A field labeled something like company name seems perfectly clear right up until a single document mentions the parent corporation in the header, one of its subsidiaries in a clause halfway down, and a product brand in the fine print. Each is a company and each is a name, and the label alone gives the model no way to know which the field wants. So it decides for itself, and it decides differently as the wording drifts from one document to the next. That ambiguity was never the model’s to settle. It was a call the schema’s author chose not to make, deferred to generation time, where it resurfaces as output that will not hold still.

A description that earns its keep settles that question before it can turn into a wrong value. Its first job is to say which candidate the field wants when several present themselves, so that selection is fixed by the author rather than improvised by the model under the pressure of whatever a given document emphasizes. From there it pins down the exact rendering expected, so that dates, identifiers, and status codes arrive in one canonical form instead of a handful that all merely happen to parse. And it draws the edges explicitly, spelling out what qualifies and what falls outside, so the model is not left inventing a boundary every time an awkward case appears. Each of those is a decision the description makes once, in place of a decision the model would otherwise remake, unevenly, on every input.

The failure modes are just these virtues withheld. Leave the description vague and the model is forced back into the guess the description was supposed to remove. Say nothing about format and the values will look right while quietly disagreeing with each other, one document’s date rendered one way and the next document’s another, both parsing cleanly, the divergence surfacing only when something downstream tries to compare them. That is the schema’s failing, not the model’s. Leave the edges undrawn and every borderline case becomes the model’s to adjudicate, freshly and differently each time. In each case a decision that belonged to the author at design time has been handed back to the model at generation time. Precision here is not polish. It is the point at which extraction accuracy is actually determined.

Deciding what missing means

When the data for a field is genuinely absent, the schema has to say what absent looks like, and this is a design decision rather than an implementation detail to settle later. There are two coherent strategies, and they trade against each other.

The first is to let absence show. When nothing is found, the field comes back empty or drops out of the object altogether, and that emptiness is itself a signal: a reader can tell, unambiguously, the difference between a value that was located and one that never existed. Any code that keys its behavior on whether the field is populated can act on that signal directly. The price is that every such reader has to cope gracefully with the empty or absent case, and that discipline has to be enforced rather than hoped for. The second is to fill the gap with a stand-in, an empty string or a fixed placeholder that means nothing was here. The output then has one consistent shape, which is easier to consume, but the shape has swallowed the very distinction the first strategy preserved. A placeholder and a genuinely extracted value are now indistinguishable, and nothing downstream can pull them back apart.

Neither is the right answer on its own. Which one fits turns on two properties of the system consuming the output: whether it needs to keep telling “found nothing” apart from “found something,” and whether it can weather an empty or absent field without falling over. What cannot be justified is declining to choose, because a question the schema leaves open does not stay open. The model closes it, call by call and not the same way twice, at exactly the point in the pipeline where you have the least leverage.

Nesting reflects the document, only as far as the task needs

Real documents have structure, and a schema is free to follow it. When some part of the document is a self-contained thing with several attributes of its own, that thing earns a nested object of its own: a contracting party, carrying its legal name, the capacity in which it signs, and a registered address; a single line on an invoice, carrying what was sold, how many, and at what unit price. When the document holds a run of such things whose number is not fixed, an array of those objects is the right container. Nesting used this way is simply an accurate rendering of what is actually on the page.

Two disciplines keep that rendering honest. The first: every nested object deserves the same deliberate required-versus-optional judgment as the top level. A field does not become safe to require merely because the object enclosing it is present. The pressure to fabricate reaches one level down intact, and it hides better there, which is exactly why it gets skipped. The second: recognize the point past which more structure stops paying for itself. Depth has a price on both sides. It taxes maintenance and testing, because each additional level multiplies the situations a test has to reach. And it taxes reliability, because the deeper and more tangled the structure the model must keep straight, the less dependably it fills the structure in. Go as deep as the extraction truly demands and no deeper, and collapse any layer that exists for tidiness rather than because the data genuinely relates that way. Every unnecessary level is one more chance for the model to merge two things that should stay separate, or to file a nested block under the wrong parent.

Test the schema against the inputs that break it

A schema is code, and the parts of it worth testing are precisely the parts a clean development set will never exercise. Extraction that works on tidy, complete examples tells you almost nothing about how the schema behaves on the varied, partial, awkward inputs production delivers, and those are where the design decisions above are actually load bearing.

Three tests aim straight at the failure modes. The first withholds a field’s data entirely and checks that what returns is an honest blank rather than something the model invented to avoid leaving the slot open. It is the direct probe of the required-versus-optional line, and the surest way to catch a field you marked required that the source cannot always furnish. The second builds a document in which two values each have a legitimate claim on the same field, then checks whether the description supplies enough guidance to land on the intended one. It puts a number on description quality, which otherwise stays hidden until the wrong candidate reaches production. The third drives every typed field out to its extremes, the empty string and the absurdly long one, zero and a negative, and confirms the schema still behaves rather than quietly assuming everything lands in the easy middle of the range.

None of these conditions shows up dependably in whatever inputs you happened to build against, which is the whole reason the failures they expose slip through to production untested. A modest collection of these adversarial cases, kept alongside each schema and each one aimed at a single failure mode, earns its keep the first time it catches a required field quietly hallucinating a value on the one document shape no one thought to try.

The most controllable factor you have

Extraction reliability rests on three things, and exactly one of them is yours. The model you can influence but never fully govern. The inputs come as they come, uneven, incomplete, now and then malformed. The schema alone is entirely within your control, which is the precise reason it, and not the other two, is where reliability is won or lost. Everything the earlier sections argued is a lever inside that one controllable factor. A description written with care removes the ambiguity that shifting phrasing would otherwise leave for the model to resolve. A required set drawn honestly refuses to demand values the input was never going to contain. Holding a field to a fixed set of permitted values keeps free text out of the places where later logic makes decisions. Nesting kept to what the data actually warrants stops distinct things from bleeding into one another. None of these announces itself as important, and every one of them tilts the odds of correct, repeatable output.

The trap in all of it is that a bad schema gives no sign of being bad at the moment you write it. It compiles, it passes validation, it handles every example you had in front of you. The bill comes due later and somewhere else, paid by code downstream that was never designed to shoulder it, or hidden inside fabricated values that never once tripped an exception. Because the model consults the schema as it writes, getting it right is not tidying-up done after the real work. It is the real work of governing what the model produces, and the only honest measure of it is not how it fares on the clean case but whether it holds against the messy, partial, uncooperative inputs production is certain to deliver.