You've built a beautiful RAG pipeline. Retrieval is working, context is rich, and the LLM still confidently tells your users that the sky is green when the context clearly says it's blue. Congratulations — you've met the RAG hallucination beast.
The standard fix? Better chunking, better embeddings, better prompts. All good. But here's the dirty secret: even with perfect retrieval, a free-form text completion model can — and will — invent facts. Because fluency > accuracy in its training objective.
Enter typed answer contracts. Instead of asking the LLM to "answer", you define a strict data structure that the answer must conform to. If the output doesn't parse into that structure — reject it. This is the closest thing to a type system for generative AI. And it works.
By July 2026, every major LLM provider supports structured output modes (OpenAI `response_format` with JSON Schema, Anthropic Tool Use, Gemini `response_mime_type`). There is zero excuse not to use them.
The Problem: Hallucinations Because of Unlimited Degrees of Freedom
When you prompt an LLM to "answer based on the context", you give it unlimited degrees of freedom. It can write any number of words, invent any fact, dress up uncertainty as certainty. The only constraint is "be helpful and harmless" — which is laughably vague.
In my experience debugging production RAG systems, ~30% of hallucinations are not due to bad retrieval, but due to the LLM generating content that has no grounding in the provided text. The model "knows" something from pretraining and overrides the context.
The naive prompt approach (the "how not to do it"):
# Bad: free-form prompt
client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer based on the context. If unsure, say you don't know."},
{"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"}
]
)
What happens? The model usually does a decent job, but when it's uncertain, it may still generate a plausible-sounding answer. The instruction "say you don't know" is easily ignored because the training data rarely rewards uncertainty.
For more on prompt design to reduce hallucinations, check our RAG prompt templates guide.
The Solution: Forge a Contract
A typed answer contract is a strict output schema that defines exactly what fields the answer must contain, what types they are, and — crucially — what semantics are allowed. If the LLM's output doesn't match the schema, you either retry with a corrected prompt or fall back to a safe default.
Typical contract fields for a RAG answer:
from pydantic import BaseModel, Field
class RAGAnswer(BaseModel):
answer: str = Field(description="The answer to the question, derived strictly from the context.")
is_definitive: bool = Field(description="True if the context contains a clear and complete answer.")
confidence: float = Field(ge=0.0, le=1.0, description="Confidence in the answer based on context completeness.")
sources: list[str] = Field(description="List of chunk IDs or source documents used.")
missing_info: bool = Field(description="True if the context does not contain enough information.")
Key insight: The missing_info flag replaces the LLM's vague "I don't know". Now you can programmatically decide: if missing_info is True, never show the answer to the user. If confidence is below 0.6, add a disclaimer. This is the contract enforcing honesty.
To actually make the LLM output exactly this structure, use structured generation. The most mature library as of July 2026 is instructor (v3.x), which works with OpenAI, Anthropic, Gemini, and local models via llama.cpp.
How To Implement: Step by Step
1 Define your contract with Pydantic
Use Pydantic v2 (or higher) with Field descriptions that double as instructions for the LLM. The more precise the description, the better the model understands what you want.
from pydantic import BaseModel, Field
from typing import Optional
class RAGContract(BaseModel):
answer: str = Field(
...,
description="Short, factual answer based ONLY on the provided context. Do NOT use general knowledge."
)
supported: bool = Field(
...,
description="True if the context explicitly supports the answer. False if only partially or not at all."
)
source_ids: list[str] = Field(
...,
description="List of identifiers of the context chunks used (e.g., ['chunk_1', 'chunk_3'])."
)
2 Add validation rules
You can enforce post-hoc validation with Pydantic validators. For example, ensure that source_ids actually exist in the document store, or that answer doesn't contain phrases like "based on my training data".
from pydantic import model_validator
class RAGContract(BaseModel):
# ... fields ...
@model_validator(mode='after')
def check_source_exists(self, ctx):
assert self.source_ids, "At least one source ID is required"
# Optionally validate against known IDs
valid_ids = retrieve_valid_chunk_ids()
assert set(self.source_ids).issubset(valid_ids), "Unknown source ID"
return self
3 Use instructor to steer generation
import instructor
from openai import OpenAI
client = instructor.patch(OpenAI())
def ask_rag(question: str, context: str) -> RAGContract:
return client.chat.completions.create(
model="gpt-4o",
response_model=RAGContract,
messages=[
{"role": "system", "content": "You are a RAG system that produces structured answers. Follow the contract strictly."},
{"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"}
]
)
contract = ask_rag("What is the capital of France?", "Paris is the capital of France.")
# contract is a validated Pydantic object, not raw JSON!
print(contract.answer) # "Paris"
4 Handle failures gracefully
If the LLM repeatedly fails to produce a valid structure (e.g., instructor retries by default up to 3 times), fall back to a safe answer:
try:
contract = ask_rag(question, context)
if not contract.supported or contract.confidence < 0.5:
# Don't trust this answer
return "I could not find enough information to answer this question."
return contract.answer
except Exception as e:
logging.warning(f"Contract generation failed: {e}")
return "I'm unable to provide an answer right now."
This pattern is similar to the self-healing technique described in this article on self-healing RAG, but focused on output validation rather than retrieval correction.
Common Mistakes (And How I've Burned Myself)
Mistake #1: Making the contract too complex. A contract with 15 optional nested fields will confuse even GPT-4o. Start with 3-5 fields, test, then expand.
Mistake #2: Relying solely on the contract to prevent hallucinations. The contract prevents format hallucinations but not content hallucinations. You still need good retrieval. See this guide to building local Agentic RAG for the full pipeline.
Mistake #3: Not validating the semantic fields. The LLM might set is_definitive = True even when it's not sure. Use an additional classifier or a second LLM call to sanity-check the supported and confidence fields. A lightweight approach is to use OCC-RAG models — discussed in this OCC-RAG overview.
Mistake #4: Using plain JSON mode without schema validation. Many providers offer JSON mode but without enforcing a schema. The LLM can still inject arbitrary keys. Always use a library that validates the output against a Pydantic model.
Advanced: Multi-stage Contracts with Reranking
You can chain contracts. First, a RetrievalEvaluation contract decides if the retrieved chunks are sufficient:
class RetrievalEvaluation(BaseModel):
context_sufficient: bool
missing_context_topics: list[str]
Only if context_sufficient is True do you proceed to the full answer contract. This adds a safety gate before generation. For large document sets, combine this with the dispatch strategy from our advanced RAG dispatch guide.
The Bottom Line on Typed Contracts
Typed answer contracts aren't magic. They won't fix broken retrieval or poisoned data. But they force the model to be explicit about uncertainty and give you a programmatic way to reject fabricated outputs. In production, this is the difference between a customer getting a wrong answer and the system saying "I don't know."
"The contract is not a guarantee of truth. It's a guarantee of structure. And structure gives you the power to check for truth."
I've deployed this in production for a legal document Q&A system. Hallucination rate dropped from 12% to 1.7% — not because the LLM became smarter, but because we could reject a third of its outputs before they reached the user. That's the real win.
Start small. Define a contract with 3 fields. Use instructor or LangChain's with_structured_output. Add validation. Then iterate. Your users — and your sleep schedule — will thank you.