Skip to content
WhySoGeek.
AI

How to Force a Local LLM to Return Clean JSON with Ollama Structured Outputs

Stop parsing messy model text by hand. Ollama's structured outputs constrain any local model to a JSON schema you define.

Sam Carter 9 min read
Cover image for How to Force a Local LLM to Return Clean JSON with Ollama Structured Outputs
Photo: US Army Africa / flickr (BY 2.0)

If you have ever tried to extract structured data from a local model, you know the pain: you ask for JSON, and the model hands back JSON wrapped in an apology, a code fence, and a closing paragraph.

Quick answer

Pass a JSON schema in the format parameter of an Ollama /api/chat or /api/generate request, and Ollama compiles it into a grammar that constrains decoding so the model can only emit tokens that keep the output valid. In Python, let Pydantic generate the schema with model_json_schema() and validate the reply with model_validate_json(); in JS, use Zod. Set temperature to 0, still ask for JSON in the prompt, and you get parseable JSON on every call. Requires Ollama v0.5.0 or newer.

Ollama's structured outputs feature fixes this at the source. Instead of nudging the model with prompts and praying, it constrains generation itself so the model can only produce text that matches the shape you asked for. The result is parseable JSON on every call, from any model in the library.

Key takeaways

  • Structured outputs work by constrained decoding, not prompting: Ollama converts your JSON schema into a grammar and zeroes out the probability of any token that would break it.
  • You enable it by passing a JSON schema (or the string "json") in the format parameter of a /api/chat or /api/generate request.
  • In Python, let Pydantic generate the schema with model_json_schema() and validate the reply with model_validate_json(); in JavaScript, use Zod.
  • Set temperature to 0 and still ask for JSON in the prompt. The grammar controls shape; the prompt controls intent and field quality.
  • Watch two sharp edges in 2026: deeply nested or recursive schemas can degrade output, and disabling thinking on some thinking-capable models can silently drop the format constraint.

What "structured outputs" actually does

A structured output is not a prompt trick. When you pass a JSON schema to Ollama, the runtime compiles it into a grammar and constrains token sampling so that every generated token keeps the output valid against that schema. Under the hood Ollama builds on llama.cpp's GBNF (a grammar format that defines exactly which tokens are legal next), and since the v0.5 release it generates that grammar automatically from whatever JSON schema you send. Invalid tokens are effectively given zero probability, so the model physically cannot wander off-format.

That distinction matters. A prompt that says "respond only with JSON" is a request the model may ignore; constrained decoding is a hard rule enforced during sampling. It also has a pleasant side effect: because the model stops spending tokens deliberating over formatting, structured generation can run noticeably faster than free-form text for the same task.

You enable it by sending a JSON schema in the format parameter of a chat or generate request. That is the whole mechanism: define the shape, pass it in, parse the response.

Note

Structured outputs require Ollama v0.5.0 or newer. Update the server, then run ollama pull for your model before testing so you are on a build that compiles JSON schema to a grammar.

The fastest path: Python with Pydantic

The recommended approach in Python is to describe your data with a Pydantic model and let it generate the schema for you. The workflow is three steps: define the model, hand its schema to format, then validate the response back into a typed object.

from ollama import chat
from pydantic import BaseModel

class Book(BaseModel):
    title: str
    author: str
    year: int
    genres: list[str]

response = chat(
    model="qwen3",
    messages=[
        {"role": "user", "content": "Describe the novel Dune as JSON. Return as JSON."}
    ],
    format=Book.model_json_schema(),
    options={"temperature": 0},
)

book = Book.model_validate_json(response.message.content)
print(book.title, book.year)

Two details carry the weight here. First, Book.model_json_schema() produces the exact JSON schema Ollama needs, so you never hand-write or maintain it. Second, model_validate_json() parses and validates the response into a real typed object, which means a wrong type or missing field fails loudly instead of slipping through into the rest of your program.

Python source code defining a Pydantic model on a dark editor screen
Photo: Kamil Rejczyk / flickr (BY 2.0)

Calling the raw HTTP API

If you are not in Python, the pattern is identical: the format field accepts any valid JSON schema object. Here is the same idea against the HTTP endpoint with curl, constraining the model to return an array of strings under a required key.

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3",
  "messages": [{"role": "user", "content": "List two planets as JSON"}],
  "stream": false,
  "format": {
    "type": "object",
    "properties": {
      "planets": {
        "type": "array",
        "items": { "type": "string" }
      }
    },
    "required": ["planets"]
  }
}'

Because format takes a standard JSON schema, you can constrain nested objects, arrays, enums, and required fields exactly as you would for any API contract. In Node, the same flow works with Zod: define a schema, serialize it with a JSON-schema helper, and pass the result to format. If you are weighing which runtime to standardize on for this kind of work, our comparison of local inference engines covers where Ollama fits against vLLM and raw llama.cpp.

Constraining the values, not just the shape

Constrained decoding guarantees the output is valid JSON. It does not guarantee the values are correct. Three habits close that gap.

Keep the temperature at zero

Set temperature to 0 for extraction and classification. Deterministic sampling reduces the chance the model invents a plausible-looking value purely to satisfy a required field. Ollama's own guidance leans on determinism for exactly this reason.

Still ask for JSON in the prompt

Even with the grammar enforced, a short instruction like "Return as JSON" helps the model understand the task and tends to produce better-populated fields rather than empty placeholders. The schema controls the structure; the prompt controls the intent.

Use enums for classification

For labeling tasks, define an enum in your schema so the model can only return one of your allowed values. This is far more robust than parsing a free-text label out of a sentence. Note one caveat below: enum honoring has been imperfect in some builds, so always validate the returned label against your allowed set rather than assuming the grammar caught it.

Tip

Always run the response through a validator (model_validate_json in Pydantic, a Zod parse in JS). The grammar enforces structure; validation catches the semantic edge cases, such as an empty array where you expected items, before bad data reaches the rest of your app.

Schema versus prompt: who controls what

A frequent source of confusion is which lever does which job. The grammar and your prompt are not interchangeable; they handle different failure modes:

ConcernControlled byWhat it guarantees
Output is valid JSONThe compiled grammarBrackets, types, required keys are well-formed
Output matches your shapeThe JSON schema you passFields, nesting, arrays match your contract
Values are sensibleThe prompt + temperature 0Better-populated, less-invented field values
Label is from an allowed setEnum + your own validationOne of your values (verify, enums can slip)
Bad data never propagatesPydantic / Zod validationLoud failure on wrong type or empty array

Known sharp edges in 2026

Structured outputs are reliable, but a few rough spots are worth knowing before they bite you in production.

  • Deep nesting and recursion. Ollama's docs explicitly warn that deeply nested or recursive JSON structures may produce degraded or incomplete results. Flatten your schema where you can, and split very large extractions into smaller calls.
  • Thinking models and think=false. On some thinking-capable models, disabling the reasoning step has been reported to silently drop the format constraint, so the model returns plain prose instead of JSON. If a model ignores your schema, check whether thinking is involved before blaming the schema itself.
  • Enum adherence. Community reports show enum values occasionally not being honored. Treat enums as a strong hint plus a hard validation check, not an absolute guarantee.
  • No direct GBNF passthrough. Ollama generates the grammar for you from JSON schema and does not expose raw GBNF, so anything you cannot express in JSON schema is currently out of reach.

When to reach for it

Structured outputs shine anywhere you need machine-readable data from a model: pulling fields out of receipts or emails, classifying support tickets, building a tool-calling layer, or generating config another program will consume. It pairs especially well with small language models, which are cheap to run locally and benefit most from having their formatting decisions taken off the table. It is also a natural fit in retrieval pipelines, where you can use it to produce clean, typed metadata alongside your RAG chunking strategies.

Because it all runs on your local Ollama install, you get this reliability with no API bill and no data leaving your machine, which is exactly why people run models locally in the first place. The shift is small but meaningful: instead of writing brittle regex to claw structure out of prose, you declare the shape once and let the runtime enforce it.

Frequently asked questions

Does the format parameter accept anything besides a full JSON schema?

Yes. You can pass the string "json" to force generic valid-JSON output without specifying a shape, or pass a complete JSON schema object to constrain the exact structure. The schema route is almost always what you want, since it gives you typed, predictable fields instead of arbitrary JSON.

Will structured outputs slow my requests down?

Generally the opposite. Because the model no longer spends tokens deciding how to format the answer, constrained generation is often faster than free-form text for the same prompt. The grammar-checking overhead is small compared to the tokens you save.

Do I have to use Pydantic or Zod?

No. They are conveniences that generate the schema and validate the response for you. You can hand-write a JSON schema and pass it directly to format, then parse the reply with any JSON parser. Pydantic and Zod simply give you type safety and loud failures for free.

Why is the model returning plain text even though I set a schema?

The usual culprits are running an Ollama build older than v0.5.0, or hitting the thinking-model interaction where disabling thinking drops the constraint. Confirm your version, keep thinking enabled on reasoning models, and simplify deeply nested schemas. For agent workflows that mix this with persistent state, see our notes on agent memory.

#ai#tools#ollama#json

Sources & further reading

Keep reading