AI AgentsLLMProductionLangChain

How I Build Production AI Agents (Not Demos)

April 10, 2026·6 min read·By Waqas Raza

Most AI agent demos work great in a notebook. They fail in production because the same shortcuts that make demos fast — skipping validation, ignoring cost, assuming tools always succeed — are the exact things production punishes.

Here is how I approach every agent I ship.

The failure modes that kill agents in production

Before designing anything, I map the ways the system can go wrong:

  1. Tool failure — an external API is down, rate-limited, or returns unexpected data
  2. Cost runaway — a loop adds tokens on every step; a $0.10 request becomes $80
  3. Hallucinated tool calls — the model invents arguments or calls tools that don't exist
  4. Context explosion — conversation history grows until you hit the context window
  5. Silent wrong answers — the agent confidently returns plausible but incorrect output

Every design decision I make targets one of these.

Tool design: minimal, typed, and idempotent

Each tool should do exactly one thing. A tool called search_knowledge_base should search. Not search, then summarize, then format. Compound tools are harder to validate and easier to hallcinate.

Every tool gets:

  • A typed input schema — validated with Pydantic or Zod before the model sees it
  • Idempotency — calling it twice with the same input is safe
  • A defined error contract — tools return structured errors, not exceptions that bubble into the agent loop
class SearchInput(BaseModel):
    query: str = Field(..., min_length=3, max_length=500)
    top_k: int = Field(default=5, ge=1, le=20)

@tool
def search_knowledge_base(input: SearchInput) -> SearchResult:
    """Search the knowledge base. Returns up to top_k relevant chunks."""
    try:
        results = vector_db.query(input.query, k=input.top_k)
        return SearchResult(chunks=results, query=input.query)
    except VectorDBError as e:
        return SearchResult(chunks=[], error=str(e))

The model sees the schema, not the implementation. Good schema descriptions cut hallucinated arguments by a large margin.

Cost control: caps, not hope

Every agent I build has explicit cost caps at three levels:

Per-step cap: max tokens per LLM call. Set via max_tokens on the model call — not as a prompt instruction the model can ignore.

Per-run cap: max number of iterations. In LangGraph, this is recursion_limit. In LangChain, it is max_iterations. Set it to something that makes sense for the task, not a large default.

Per-user/per-day cap: tracked in Redis. Each agent run records its token usage. If a user hits their budget, the run is declined before it starts — not halfway through.

def check_budget(user_id: str, estimated_tokens: int) -> bool:
    key = f"budget:{user_id}:{today()}"
    current = redis.get(key) or 0
    if int(current) + estimated_tokens > DAILY_TOKEN_LIMIT:
        return False
    redis.incrby(key, estimated_tokens)
    redis.expire(key, 86400)
    return True

Cost surprises kill trust. Hard caps prevent them.

Guardrails: validate the output, not just the input

Input validation catches bad tool calls. Output validation catches bad answers.

For every agent I build, I define what a valid output looks like — as a schema, not prose. Then I validate it.

For structured outputs, this is straightforward: use with_structured_output and a Pydantic model. For text outputs, I validate against a set of rules: minimum length, absence of certain patterns (model apologies, hedging phrases that signal the model is guessing), presence of required fields.

If output validation fails, I retry once with a corrective prompt. If it fails again, I return a structured error rather than a bad answer.

Observability: log everything

Every tool call gets a log entry: timestamp, tool name, input, output, latency, token count, cost estimate. I store these in a runs table with a thread_id.

@contextmanager
def trace_tool_call(tool_name: str, run_id: str):
    start = time.monotonic()
    try:
        yield
    finally:
        latency_ms = (time.monotonic() - start) * 1000
        db.insert("tool_calls", {
            "run_id": run_id,
            "tool": tool_name,
            "latency_ms": latency_ms,
            "timestamp": utcnow(),
        })

When something breaks in production, this is the difference between a 20-minute debug session and a 3-day investigation.

Failure handling: graceful, not silent

Agents fail. The question is whether they fail gracefully.

My rule: never let a tool exception propagate into the agent loop unhandled. Exceptions become structured error objects that the model can reason about — "the search returned an error: rate limited. Try again in 30 seconds." — rather than stack traces that crash the run.

For retriable failures (rate limits, transient network errors), I wrap tools with exponential backoff. For non-retriable failures (bad credentials, invalid input), I return immediately with a clear error.

For the overall run, I set a timeout. If an agent run takes longer than its SLA, it is cancelled and the user gets a partial result with a clear status — not a hanging request.


This is a set of patterns, not a checklist. The right approach depends on the use case. But the decision to take cost control, observability, and failure handling seriously — rather than treating them as polish to add later — is what separates agents you can ship from agents you demo once.

If you are building a production AI agent and want someone who has shipped these patterns in real systems, reach out on Upwork.

About the author

Waqas Raza

AI-Native Full-Stack Engineer. Top Rated on Upwork · $180K+ earned · 93% job success. I build production AI agents, LLM systems, Web3 platforms, and full-stack applications.

Hire me on Upwork