
Structured outputs: the unsexy LLM skill that actually ships products
Article Summary
Everyone obsesses over prompts and agents. The skill that separates a demo from a product is getting reliable, schema-conforming output. Here's the whole craft.
A student showed me a project last month that did everything right, except the one thing that mattered. He had a clever prompt, a retrieval step, even a little agent loop. The demo was genuinely impressive. Then he tried to wire the model's answer into a downstream function and the whole thing fell apart, because the model returned this:
Sure! Here's the data you asked for:
{ "name": "Acme Corp", "tier": "enterprise" }
Let me know if you'd like me to adjust anything!His parser was doing json.loads(response) and choking on "Sure! Here's the data you asked for:". So he'd bolted on a regex to find the curly braces. Then the model started wrapping the JSON in a ```json fence, so he patched the regex again. Then one day it returned "tier": "Enterprise" with a capital E and his downstream if tier == "enterprise" silently did the wrong thing for a week.
This is the part of LLM engineering nobody makes YouTube thumbnails about. Prompts are fun. Agents are exciting. But the skill that actually moves a project from "cool demo" to "thing real users depend on" is boring and unglamorous: getting the model to return output that conforms to a schema, every single time, so the next piece of code can trust it. That's the whole post. Let's do it properly.
Why free-text parsing breaks (and keeps breaking)
The failure isn't that the model is dumb. It's that natural language is a terrible API contract. A language model's default job is to produce plausible text, and "here's a friendly sentence before the JSON" is extremely plausible text. You're fighting the model's training every time you ask it to suppress that instinct with a prompt like "ONLY return JSON, no preamble, no markdown."
That prompt works most of the time. "Most of the time" is the problem. If your extraction step is 97% reliable, and a workflow chains three of them, you're at roughly 0.97³ — about 91% end-to-end. Run that a thousand times a day and you're cleaning up dozens of failures daily, by hand, forever. Reliability that's almost there is the most expensive kind, because it's good enough to ship and bad enough to haunt you.
The fix is to stop asking and start constraining. Modern APIs let you hand the model a schema and have the platform guarantee the output matches it. Two ways exist, and they're genuinely different mechanisms.
JSON mode vs. structured outputs: not the same thing
People use "JSON mode" and "structured outputs" interchangeably and then get burned, so let's separate them clearly. On OpenAI's API:
- JSON mode (
response_format: { "type": "json_object" }) guarantees the output is syntactically valid JSON. That's it. It does not guarantee any particular keys, types, or shape. The model could return{}or{"foo": 1}and JSON mode is satisfied. OpenAI's own docs now treat this as the legacy path. - Structured Outputs (
response_format: { "type": "json_schema", ... }withstrict: true) guarantees the output conforms to your specific JSON Schema — the right keys, the right types, the enum values you defined. This is the one you want for anything that feeds another function.
The difference matters because valid-JSON-but-wrong-shape is the kind of bug that passes every smoke test and breaks in production. Here's the structured-outputs call in raw form, which I think is worth seeing once before the SDK hides it from you:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-2024-08-06", # or a newer snapshot; 4o-2024-08-06+ supports this
messages=[
{"role": "system", "content": "Extract the company and tier from the user's message."},
{"role": "user", "content": "We just closed Acme Corp on the enterprise plan."},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "company_record",
"strict": True,
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"tier": {"type": "string", "enum": ["free", "pro", "enterprise"]},
},
"required": ["name", "tier"],
"additionalProperties": False,
},
},
},
)
print(response.choices[0].message.content)
# {"name":"Acme Corp","tier":"enterprise"}Notice "tier" is an enum. The model now cannot return "Enterprise" with a capital E — that whole category of bug is gone, not mitigated. That's the difference between asking and constraining.
How the three real mechanisms compare
There are three distinct tools in the modern toolbox, and they get conflated constantly. Here's how I keep them straight:
| Mechanism | What it guarantees | When I reach for it |
|---|---|---|
JSON mode (json_object) |
Valid JSON syntax only | Almost never now — superseded |
Structured Outputs (json_schema + strict) |
Output matches your exact schema | Data extraction, classification, anything feeding code |
Tool / function calling (strict: true) |
The arguments you pass to a function match a schema | When the model should do something, not just return something |
The third one trips people up. Function calling and structured outputs feel like rivals; they're not. Structured outputs shape the model's reply to you. Tool calling shapes the arguments the model hands to a function it wants you to run. You can enable strict: true on a function definition to get the same schema guarantee on those arguments. Use response-format when you want data back; use tool-calling when you want an action triggered with validated parameters.
Anthropic does the same thing, with different parameter names
If you're on Claude rather than OpenAI, the capability exists and is no longer experimental. Anthropic launched Structured Outputs in public beta on November 14, 2025, and it's since gone generally available on the Claude API. The mechanism is the genuinely interesting part: rather than asking nicely, it compiles your JSON Schema into a grammar and constrains token generation during inference, so the model literally cannot emit a token that would violate the schema.
The parameter names differ from OpenAI's, so don't copy-paste blindly. On Claude the format lives under output_config.format (an earlier beta used a top-level output_format, which still works during a transition window):
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
messages=[
{"role": "user", "content": "We just closed Acme Corp on the enterprise plan."}
],
output_config={
"format": {
"type": "json_schema",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"tier": {"type": "string", "enum": ["free", "pro", "enterprise"]},
},
"required": ["name", "tier"],
"additionalProperties": False,
},
}
},
)
print(response.content[0].text)Same idea as OpenAI's strict mode — a hard schema guarantee — wearing different clothes. Claude also supports strict tool use (strict: true on a tool's input_schema) for the function-calling case. Both Anthropic SDKs accept Pydantic (Python) and Zod (TypeScript) schema definitions, which is the bridge to the next section.
Pydantic: define the schema once, validate on the way back
Hand-writing JSON Schema by hand gets old fast, and it's a second source of truth you have to keep in sync with whatever class your code actually uses. AI engineering in practice leans hard on Pydantic precisely because it lets you define the shape once, as a normal Python class, and reuse it as both the schema you send and the validator you check against.
The OpenAI Python SDK has a .parse() helper that takes a Pydantic model directly, generates the strict schema for you, and hands back a parsed object:
from pydantic import BaseModel
from enum import Enum
from openai import OpenAI
client = OpenAI()
class Tier(str, Enum):
free = "free"
pro = "pro"
enterprise = "enterprise"
class CompanyRecord(BaseModel):
name: str
tier: Tier
completion = client.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "Extract the company and tier."},
{"role": "user", "content": "We just closed Acme Corp on the enterprise plan."},
],
response_format=CompanyRecord,
)
record = completion.choices[0].message.parsed # a CompanyRecord instance
print(record.tier) # Tier.enterpriserecord isn't a dict you hope is shaped right — it's a validated CompanyRecord with a real enum. Pydantic v2 (current as of the 2.13.x line) does this validation through a Rust core, so the cost is negligible. Your editor autocompletes record.tier. Your type checker catches it if you typo the field. This is the moment the LLM output stops being a string and becomes data.
The retry loop you still need
Here's the honest part nobody likes: even with strict mode, you are not 100% home free. Strict mode guarantees structure, not semantics. The model can return a perfectly schema-valid object that's still wrong — a date in the wrong format inside a string field, a number that's out of the range you care about, a required field it filled with a confident hallucination. Strict mode also doesn't enforce every JSON Schema keyword: OpenAI's implementation, for instance, ignores pattern, format, minLength, and minimum — they're documentation, not constraints.
So you add your own semantic validation in Pydantic, and you wrap the call in a bounded retry that feeds the error back to the model:
from pydantic import BaseModel, field_validator, ValidationError
class Invoice(BaseModel):
total_cents: int
currency: str
@field_validator("total_cents")
@classmethod
def non_negative(cls, v: int) -> int:
if v < 0:
raise ValueError("total_cents must be >= 0")
return v
@field_validator("currency")
@classmethod
def iso_4217(cls, v: str) -> str:
if len(v) != 3 or not v.isupper():
raise ValueError("currency must be a 3-letter uppercase ISO code")
return v
def extract_invoice(text: str, max_attempts: int = 3) -> Invoice:
messages = [
{"role": "system", "content": "Extract the invoice total in cents and ISO currency."},
{"role": "user", "content": text},
]
last_error = None
for _ in range(max_attempts):
completion = client.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=messages,
response_format=Invoice,
)
msg = completion.choices[0].message
# Structured outputs can also refuse for safety — check that first.
if getattr(msg, "refusal", None):
raise RuntimeError(f"Model refused: {msg.refusal}")
try:
return msg.parsed
except ValidationError as e:
last_error = e
# Feed the failure back so the next attempt can correct itself.
messages.append({"role": "assistant", "content": msg.content or ""})
messages.append({
"role": "user",
"content": f"That failed validation: {e}. Fix it and return valid output.",
})
raise RuntimeError(f"Gave up after {max_attempts} attempts: {last_error}")Two details I want to flag, because students miss both. First, that refusal check is real: when you send user-generated content, the model may decline for safety, and a refusal doesn't follow your schema — OpenAI surfaces it in a dedicated refusal field so you can detect it programmatically instead of crashing. Second, the retry is bounded. An unbounded "keep trying until it works" loop is how you turn one bad input into a $400 API bill. Three attempts, then fail loudly and let a human look.
Failure modes I see again and again
Strict mode kills the structural bugs. These semantic ones survive it, so learn to spot them:
- Enum drift. Before you constrain the field, the model invents categories —
"premium","Enterprise","ENTERPRISE_TIER"— that look right and break your equality checks. The fix is literally to use anenumin the schema. This is the single highest-leverage change you can make. - Optional-field hallucination. Make a field optional and the model will often fill it with a plausible-sounding value rather than leaving it null, because "fill in the blank" is what it was trained to do. If a field genuinely might not be present, model it explicitly as nullable and tell the model in the prompt when to use null — don't just hope it abstains.
- Confident wrong values in valid slots. A
dateis astring; strict mode is happy with"next Tuesday"in a date field. If the value has to be machine-usable, validate it in Pydantic and retry. The schema guarantees the box; you guarantee what's in the box.
When structured output beats reaching for an agent
This is the framing I most want a learner to take away. The instinct, when a task feels hard, is to reach for an agent — a loop that plans, calls tools, reflects, retries. Agents are powerful and they are also slow, expensive, non-deterministic, and a nightmare to debug.
A huge fraction of the tasks people build agents for are not multi-step reasoning problems at all. They're one-shot structured-extraction problems wearing a trench coat: classify this ticket, pull these five fields out of this email, turn this paragraph into a row. For those, a single strict structured-output call is faster, cheaper, fully testable, and you can actually unit-test the schema. Reach for the agent when the task genuinely requires the model to decide what to do next based on intermediate results. If the shape of the output is known in advance, you don't need an agent — you need a schema. (And if your task is "answer questions from my documents," that's retrieval, which I've written about separately — different tool, same discipline of not over-engineering.)
The recap
- JSON mode guarantees syntax; structured outputs guarantee your schema. Use the latter —
json_schema+strict: trueon OpenAI,output_config.formaton Claude — for anything that feeds code. - Define the schema once in Pydantic and reuse it as both the contract you send and the validator you check, via
.parse(). - Strict mode guarantees structure, not meaning. Add semantic validators and a bounded retry loop that feeds the error back. Check for refusals.
- Watch for enum drift, optional-field hallucination, and valid-but-wrong values — these survive strict mode.
- A known output shape means you need a schema, not an agent. Save the agent for tasks that genuinely branch.
None of this is flashy. That's the point — the reliable, schema-conforming plumbing is exactly what separates the projects that ship from the demos that don't.
If you're building something real and the gap between "works in my notebook" and "I trust it in production" is where you keep getting stuck, this is the kind of thing I work through with people one-on-one — your actual schema, your actual failure modes, on your actual code. The first session's free if you'd like to bring a project to AI engineering and pick it apart together. No pressure either way.
Enjoyed this post? Get the next one in your inbox.
A short, useful email when there's a new tutorial, study guide, or career-prep post on the blog. No spam, unsubscribe anytime.
Written by Ali Jabbary
M.Sc., P.Eng. • Expert Data Scientist & ML Engineer with 10+ years of experience. 500+ students helped worldwide. Specializing in Python, AI/ML, and turning complex problems into simple solutions.


