Executive Summary
Current research into AI agent performance reveals that “tool design” has replaced “tool use” as the primary bottleneck for agentic systems. While failures are often attributed to the underlying model’s reasoning capabilities, evidence suggests that a significant portion of production errors—approximately 36.9%—stem from API, integration, or configuration issues. Case studies, including Claude 3.5 Sonnet’s state-of-the-art performance on SWE-bench Verified, demonstrate that precise refinements to tool descriptions and interfaces can move benchmark milestones more effectively than increasing model size or chain-of-thought length.
This document outlines the six dimensions of agent-ready tool quality, analyzes empirical data from 2026 research papers, and establishes a prioritized diagnostic framework for debugging agent failures. The core takeaway is that the “failure surface” often mimics model reasoning failure, when the root cause is actually a flawed tool contract.
The Interface as the Bottleneck
The effectiveness of an AI agent is fundamentally constrained by the tools it is provided. Industry leaders from Anthropic and OpenAI have converged on a set of design rules focused on “agent ergonomics” rather than vendor-specific architectures.
Anthropic Perspective: Engineering guidance suggests that detailed descriptions are the “most important factor” in tool performance. They caution that increasing the quantity of tools does not inherently improve outcomes.
OpenAI Perspective: Recommendations emphasize the use of JSON Schema with strict constraints, clearly named keys, and keeping the number of initially available functions small to maintain accuracy.
The Convergence: The transition from “tool use” to “tool design” signifies that the model’s ability to call a tool is no longer the scarce capability; rather, the scarce capability is the human designer’s ability to create a clear “behavior contract” for the model.
Six Dimensions of Agent-Ready Tool Quality
A high-quality tool for an agent is more than an API wrapper; it is a behavior contract optimized for a stochastic caller. The research identifies six critical dimensions:
Dimension
Description
Requirements
Invocation Clarity
Understanding the “what, when, and why.”
Descriptions must explain the tool’s purpose, usage conditions, parameter meanings, and caveats.
Schema Discipline
Structural integrity of inputs.
Use of typed, constrained, and documented parameters via JSON Schema with explicit validation.
Operational Semantics
Behavioral hints.
Tools should expose their profile (e.g., readOnlyHint, destructiveHint, idempotentHint) to inform the model of consequences.
Response Sufficiency
Quality of output.
Returns high-signal fields and semantically meaningful identifiers while avoiding context waste or bulky payloads.
Error Recoverability
Feedback loops.
Error messages must be actionable, explaining what failed and how to retry, rather than surfacing raw stack traces.
Surface-Area Discipline
Workflow alignment.
Tools should match natural workflows rather than mirroring raw backend endpoints.
Empirical Evidence from 2026 Research
Three pivotal studies conducted in 2026 provide quantitative backing for the importance of tool descriptions:
From Docs to Descriptions: An evaluation of 10,831 MCP servers found that functional and accurate descriptions increased LLM tool selection probability by 11.6% and 8.8%, respectively. Standard-compliant descriptions reached a 72% selection probability compared to a 20% baseline.
Model Context Protocol Tool Descriptions Are Smelly!: This study of 856 tools found that 97.1% contained at least one “description smell.” Augmenting these descriptions improved task success by a median of 5.85 percentage points, though it increased execution steps by 67.46%, suggesting that “optimal signal density” is superior to maximal verbosity.
Don’t Believe Everything You Read: Analysis of 10,240 MCP servers revealed that 13% had substantial mismatches between descriptions and actual code, leading to hidden state mutations or undocumented privileged operations.
Case Studies in Tool Refinement
Several real-world examples illustrate that fixing the interface often resolves what appears to be a model reasoning error:
Anthropic Web-Search Tool: Early versions of the tool resulted in Claude needlessly appending “2025” to search queries, which degraded performance. The issue was resolved not through retraining, but by improving the tool description to steer the model correctly.
SWE-bench Verified: Claude 3.5 Sonnet achieved state-of-the-art results primarily through “precise refinements to tool descriptions,” which reduced error rates and improved task completion.
Slack and Asana Integrations: Internal tool optimization at Anthropic demonstrated that iterative evaluation of tool implementations—rather than changes to the model—led to significant improvements in held-out test sets.
Identifying the Model Bottleneck
Despite the emphasis on tool design, the model remains a bottleneck in specific scenarios. The “Knowing-Doing Gap” represents the transition where a model recognizes a tool is needed but fails to issue the correct call.
Mismatch Rates: Research shows tool-necessity mismatches of 26.5% to 54.0% in arithmetic and 30.8% to 41.8% in factual QA.
Cognitive Limitations: Industry experts like Andrej Karpathy and Dario Amodei maintain that agents are still “cognitively lacking” and that making them safe and predictable remains the most significant challenge.
Model Selection: High-complexity or ambiguous tools still require frontier models (e.g., Opus) to navigate successfully.
Prioritized Diagnostic Framework
To efficiently debug agent failures, developers should follow a specific order of operations that prioritizes the interface over the model:
Audit the Tool Contract: Examine descriptions, schemas, parameter names, and examples. Ensure the model is explicitly told when and how to use the tool.
Analyze Tool Responses: Ensure the output provides the necessary information for the next step without burying it in “context waste.”
Evaluate Tool Errors: Determine if error responses teach the model how to recover or if they are formatted for humans.
Review Catalog Size: Manage the context budget. Only load necessary tools; consider using mechanisms that load descriptions always but tool bodies only on demand.
Modify Prompts and Model Choice: Only after the tool interface is optimized should developers resort to swapping models or rewriting core prompts.











