Tool-calling in production: where LLM agents actually break

Complex AI workflow diagram displayed on a monitor

The happy path for tool-calling agents looks clean. The model receives a user request, selects the right tool, constructs valid parameters, receives the result, and returns a useful response. In a demo environment, with three tools and a benign input, it's straightforward.

In production — with 20+ tools, ambiguous inputs, real data with edge cases, and users who don't follow instructions — it's considerably more chaotic. This article covers the failure modes we've observed running tool-calling agents on real business workflows, and what's actually worth doing about them.

The failure taxonomy we ended up with

After reviewing logs from several production deployments, we found most failures cluster into four categories. Not all are equally common or equally consequential, but they all appear with enough regularity that they're worth planning for.

1. Tool selection errors

The model calls the wrong tool. Not because the tools are poorly described, but because at scale the signal-to-noise ratio in tool definitions degrades. When you have a tool called get_customer_order and another called fetch_order_details, and a user asks "what did they order?", the model will sometimes pick based on name similarity rather than functional match.

This gets worse as you add tools. We've seen reliable tool selection degrade noticeably after 15–20 tools are available in a single context. The model doesn't fail obviously — it just quietly picks a plausible-sounding option that returns incomplete results.

Mitigation approaches that actually help: grouping tools into namespaced categories, adding disambiguation examples to tool descriptions, and in more complex cases, implementing a two-stage routing step where a lightweight classifier narrows the candidate set before the agent decides.

2. Parameter hallucination

The model calls the right tool with invented or partially invented parameters. This is most common when required parameters aren't present in the user's input and the model fills them in rather than asking for clarification.

If a function requires a date_range parameter and the user didn't specify dates, some models will infer "last 30 days" and proceed. That might be fine. It also might return data that confidently answers a question no one asked.

The reliable fix is strict parameter validation with explicit error return types the model can interpret — and a system prompt that instructs the model to ask for missing required inputs rather than infer them. The permissive default is wrong for most business contexts.

3. Retry loops

Tool returns an error. Model tries again with identical parameters. Tool returns same error. Repeat until context window is exhausted or token limit is hit.

This happens because most models have been trained to be persistent. An error response is treated as a recoverable obstacle rather than a stop condition. Without explicit retry logic and maximum attempt guardrails in your orchestration layer, loops are common.

The solution isn't complicated — implement retry caps at the orchestration level, classify error types (retryable vs. non-retryable), and include fallback paths for non-retryable failures. But "isn't complicated" doesn't mean teams do it. We've seen systems with no retry handling at all.

4. Context window pressure on multi-step tasks

Multi-step tool-calling agents accumulate context quickly. Each tool call and result takes tokens. By step six or seven of a complex task, you may be working with a degraded context where early instructions are no longer reliably influencing the model's behavior.

The symptoms are subtle: the model starts making different assumptions, stops following formatting instructions it was following earlier, or forgets constraints established at the start of the conversation. It's not a crash — it's a drift that's hard to detect without systematic output monitoring.

Context pressure doesn't look like failure. It looks like a slightly different agent than the one you designed, making slightly different decisions. That's worse, in some ways.

What actually helps in practice

Some things that sound like solutions don't do much. Longer system prompts with more detailed instructions tend to help less than expected once you're under context pressure — the instructions are there but their weight decreases. Rephrasing tool descriptions rarely fixes persistent selection errors if the real problem is tool count.

What tends to work:

Reducing tool count per agent by splitting responsibilities across specialized agents with narrow tool sets
Explicit orchestration logic rather than relying on the model to manage multi-step flow
Structured output validation on every tool call, not just at final response time
Step-level logging that captures tool selections and parameters, not just final outputs
A separate evaluation harness that runs representative failure scenarios on every deployment

On benchmarks and real conditions

Most published benchmarks for tool-calling agents test narrow, well-specified scenarios. They're useful for comparing models but not reliable predictors of production behavior. Real business data is messier, user queries are more ambiguous, and the specific combination of tools your system uses won't match whatever was tested.

The only evaluation that tells you how your system behaves in your context is testing your system in your context. That sounds obvious, but the number of teams that ship based primarily on published benchmark scores is non-trivial.

We've published our internal evaluation framework for tool-calling agents separately — it's available on request and not behind a paywall. If you're designing a test harness from scratch, it may save you some iteration time.

What we don't know

The field is moving fast enough that some of what we've written here will be outdated within a year. Model providers are actively improving tool-calling reliability. Context window sizes are increasing. Some failure modes we've described may be significantly mitigated by the time you're reading this.

We'll update this article when our testing shows meaningfully different results. The version date is in the header.

← Back to blog Next article →