DeepFabric - Micro Agent Training Pipeline

Tool calling, is arguably to date AI's nearest reach to a killer app type experience, in making LLMs useful for real-world applications, accelerated even more by the popularity of MCP. Yet despite its ubiquity, the mechanisms behind it remain opaque to a good number of developers. How does a text predictor learn to use APIs? What's actually happening when you see those "reading file" or "running analysis" messages? And why do some tool calls fail spectacularly while others work like magic?

In this deep dive, we'll demystify tool calling from the ground up. We'll peek behind the curtain to see the actual mechanisms that bridge natural language to API calls.

We will go from the high level overview, to the nitty gritty details of how it works under the hood, and finally look at some example implementations.

Last of all we will look at how DeepFabric can help you generate high quality datasets to train and fine-tune your own tool calling models, perfect for building your own custom AI Agents.

Taxonomy (Tools, Function Calling, MCP)

Before we go any further, let's get some Taxonomy in place. At present we have Tools, Function Calling and Model Context Protocol (MCP). Each of these terms alludes to pretty much the same underlying concept, but there are some subtle differences around formatting, structure and implementation.

The term "Function Calling" originated with OpenAI's implementation, where models could call specific functions defined by the developer. "Tools" is a more general term that followed, encompassing not just functions but any external capability an LLM can invoke. Model Context Protocol (MCP), introduced by Anthropic, represents a more structured and opinionated approach to Tools. MCP isn't just about calling functions; it's a standardized protocol for persistent connections between LLMs and external systems. It defines how servers expose resources, how clients discover capabilities, and how they maintain stateful connections. Think of MCP as the difference between making individual REST API calls versus maintaining a WebSocket connection with a service.

For the purposes of this article, we'll primarily use "Tools" as our catch-all term, but I'll highlight specific differences where they matter for implementation.

High Level Overview

At a high level, Tools are a way for an LLM to interact with external systems, but in a structured way. An LLM is made aware of Tools at its disposal, and how the Tool(s) should be called. In turn, it can call these Tools at its own discretion and get back structured data that it can then process and incorporate into a response or reasoning chain.

This all happens through an orchestration of prompts, special tokens, and structured outputs and involves an application (orchestrator), the inference system and of course the Large Language Model itself. When you send a message to an LLM with tools enabled, you're not just sending your prompt; you're also sending schemas that describe what tools are available and how to use them.

So to summarise - Tools allow models to access up-to-date information and / or to perform actions or computations that go beyond text generation. This includes making API calls, querying databases, executing code, sending emails, or even controlling other software.

Why is Tool Calling Needed?

Tool calling transforms LLMs from static knowledge bases into dynamic agents that can interact with the world. Consider a customer service scenario: without tools, an LLM can only provide generic responses based on its training. With tools, it can look up specific account information, check real-time inventory, process returns, and update customer records. This shift from passive response to active engagement is what makes tool calling so powerful.

How Models Learn to Use Tools

Understanding how models actually learn to use tools helps explain both their capabilities and limitations. During training, models are now exposed to millions of examples of tool use through a process called instruction tuning. These examples teach the model to recognize patterns: for example, when a user asks about weather, the model learns to invoke weather tools; when asked to calculate, it learns to use calculator functions.

The training process involves several stages. First, models undergo pre-training on vast text corpora where they learn language patterns and world knowledge. Then, during instruction tuning, they're specifically trained on examples that include Tool calls. These examples teach the model how to parse user intent, select appropriate tools, format proper tool calls, and incorporate tool results into responses. Some models also undergo reinforcement learning from human feedback (RLHF) where human raters specifically evaluate the quality of tool use.

Modern approaches use synthetic data generation to create diverse tool-calling scenarios. This is where systems generate millions of examples of tool use across different domains, helping models generalize beyond their original training examples. The quality of this synthetic data significantly impacts the model's ability to use tools correctly in production and is where DeepFabric's dataset generation capabilities can be particularly valuable, as it provides datasets customized to leverage specific tools and APIs.

Functions and Schemas

So Tools are essentially just functions that the LLM can call, but they need to be described in a way the model can understand. Each tool has a defined schema that specifies its name, description, input parameters, and expected output format. This schema serves as a contract between the LLM and the external system.

The schema typically includes several key components. The function name should be clear and descriptive, like get_weather or search_database. The description is crucial as it helps the model understand when to use this tool versus others. It should include details about what the function does, when it should be used, and any important limitations. The parameters section defines what inputs the function expects, using JSON Schema for detailed specifications. This includes parameter types (string, number, boolean, array, object), constraints (minimum/maximum values, string patterns, enum values), whether parameters are required or optional, and descriptions for each parameter to guide the model.

Here's a comprehensive example that shows a more complex tool schema:

json

The "strict" field is particularly interesting. When set to true, it tells the model to strictly adhere to the schema without adding extra fields or deviating from the specified format. This helps reduce errors but can sometimes limit the model's flexibility in handling edge cases.

The Tool Calling Process

When a user sends a message to an LLM with tools enabled, a flow of orchestration begins. First, the system combines the user's message with the available tool schemas into a specially formatted prompt. This prompt uses model-specific formatting that helps the LLM understand the context and available options.

The model then processes this combined input and makes a decision. It might determine that no tools are needed and respond directly with text. It might identify that one or more tools would help answer the query. Or it might ask clarifying questions before proceeding with tool use. This decision-making process happens through the model's learned patterns from training, not through explicit programming.

When the model decides to use a tool, it generates a structured output indicating the tool call. This output needs to specify which tool to call, what arguments to pass, and sometimes includes the model's reasoning about why this tool is appropriate. The format varies by provider but typically resembles something like:

json

The orchestrator, which is the system managing the interaction between the user, LLM, and tools, then takes over. It parses the model's output to extract the tool call information, validates the call against the tool's schema, executes the actual function with the provided parameters, captures the result, and formats it for the model to process.

This is where things can go wrong. The model might generate malformed JSON, use incorrect parameter names, delimiters to enclose the json, provide values that don't match the expected types, or hallucinate tool names that don't exist. Good orchestration systems include robust error handling to catch these issues.

When a tool call fails, the orchestrator typically sends an error message back to the model, describing what went wrong. The model can then attempt to correct its mistake and try again. This retry loop is crucial for reliability. Modern frameworks often implement sophisticated retry logic with exponential backoff, context enrichment (adding more details about the error), and fallback strategies.

API to Model Mapping

Understanding how tool calls work at the token level reveals the simplicity underlying this complex system. LLMs don't actually "understand" JSON or function calls in the way we might think. Instead, they predict tokens that happen to form valid JSON structures because they've been trained on millions of examples.

Each model family uses specific tokens and formatting conventions to handle tool calls. These special tokens act as signals to the model, switching it between different modes of operation. When the model sees a tool definition, special tokens tell it "this is a tool you can use." When it needs to call a tool, it generates different special tokens that mean "I'm about to make a tool call."

The chat template system is where this all happens. Each model has a template that defines how to format conversations, including system messages, user messages, assistant responses, and tool interactions. These templates transform the high-level conversation into the specific token sequences the model was trained on.

These templates can be discovered and explored using HuggingFace's transformers library, specifically the apply_chat_template method. This method is incredibly useful for understanding what's actually happening under the hood:

We can use this method to discover the models chat format, for example with Qwen:

python

This outputs:

<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant

From here we can see that Qwen expects messages to be wrapped in special tokens (<|im_start|> and <|im_end|>) along with the role of the message. This format, originally from OpenAI's ChatML, helps the model understand the context of the conversation. While OpenAI no longer uses ChatML (their current format is proprietary), many open-source models have adopted and extended it.

Now let's see what happens when we add tools to the mix. The apply_chat_template method also handles tool formatting:

python

What's particularly interesting is that different models use different special tokens for their reasoning and tool calling. Qwen models, for example, have been trained with <think> tokens that allow them to reason through problems before making tool calls:

bash

Different models handle this differently. Llama-3 uses a format with special tokens like <|start_header_id|> and <|end_header_id|>:

python

The beauty of apply_chat_template is that it abstracts away these differences. You can write the same high-level code and the tokenizer handles the model-specific formatting. However, understanding what's happening under the hood helps explain why some models are better at tool calling than others—they've been trained with specific token patterns that make tool use more natural.

Here's a complete notebook-ready example that demonstrates the full cycle:

python

The reasoning tokens are particularly fascinating. Some models are trained to "think out loud" before making tool calls, generating internal reasoning that helps them make better decisions but isn't shown to the user. This chain-of-thought reasoning significantly improves tool selection accuracy. You can see this in action when you run the generation with skip_special_tokens=False, revealing the model's internal thought process.

Let's look at a complete example of how this works with a modern model that you can run in a notebook:

python

Provider Implementations

The implementation details vary significantly across providers, and understanding these differences is crucial for building robust applications. Each provider has made different design decisions that reflect their philosophy and target use cases.

OpenAI's implementation is perhaps the most mature and widely adopted. They use a clean JSON-based API where tools are defined in a dedicated tools array, and the model's responses include a specific tool_calls field when functions need to be invoked. OpenAI supports parallel function calling, where the model can request multiple tool calls in a single response, significantly improving efficiency for complex queries. They've also introduced structured outputs with guaranteed JSON schema compliance when strict mode is enabled.

Anthropic takes a slightly different approach with their Claude models. Instead of a separate tools parameter, they embed tool definitions directly into the system message using XML-like tags. This approach gives them more flexibility in how tools are presented to the model. Their recent introduction of computer use capabilities extends tool calling to include screen interaction, marking a significant evolution in what "tools" can mean. Anthropic's implementation emphasizes safety and reliability, with careful attention to preventing harmful tool use.

Google's Gemini models support function calling through their Vertex AI platform. They use a similar JSON schema approach to OpenAI but with some unique features like automatic function call execution in certain modes. Gemini models can also ground their responses in Google Search results, blurring the line between traditional tool calling and retrieval augmented generation.

Open-source implementations vary widely in their sophistication. Models from Meta (Llama), Mistral, and Qwen each have their own conventions. The Hugging Face ecosystem has done tremendous work in standardizing these through the transformers library, but differences still exist. Some models require specific prompt formats to trigger tool use, while others have dedicated tokens. The quality of tool calling in open-source models has improved dramatically, with recent models approaching or matching proprietary performance.

For inference servers, the landscape is equally diverse. vLLM provides high-performance inference with support for various chat templates and tool calling formats. It relies heavily on the model's tokenizer configuration to handle tool calling correctly. Text Generation Inference (TGI) from Hugging Face offers server-side chat templating, making it easier to deploy models with tool support. Ollama provides a more lightweight approach, typically requiring manual prompt engineering for tool use but offering great flexibility.

The Model Context Protocol (MCP) represents Anthropic's attempt to standardize this chaos. Instead of each provider having their own format, MCP defines a common protocol for tool discovery, invocation, and result handling. It includes features like capability negotiation (servers advertise what they can do), stateful connections (maintaining context across multiple calls), standardized error handling, and progress reporting for long-running operations. While MCP is still gaining adoption, it points toward a future where tool calling might be more standardized across providers.

Error Handling and Best Practices

Robust error handling is essential for production tool calling systems. Errors can occur at multiple levels, and each requires different handling strategies. At the model level, the LLM might generate malformed JSON, use incorrect parameter names or types, hallucinate tool names that don't exist, or get stuck in retry loops. These errors require careful validation and clear error messages that help the model correct itself.

At the tool level, external APIs might be unavailable, rate limits might be exceeded, authentication might fail, or the tool might return unexpected results. These errors need graceful degradation strategies. Sometimes the model can work around a failed tool call by trying alternative approaches. Other times, it needs to inform the user that certain information is temporarily unavailable.

Here's a production-ready error handling implementation:

python

Security is another critical consideration. Never execute arbitrary code from tool parameters, always sanitize inputs before passing them to external systems, implement proper authentication and authorization, use rate limiting to prevent abuse, and log all tool executions for audit purposes. Consider implementing sandboxing for code execution tools and careful prompt injection prevention for tools that interact with sensitive systems.

Performance optimization strategies can significantly improve the user experience. Implement caching for frequently called tools with predictable results, use parallel execution when multiple independent tools are needed, set appropriate timeouts to prevent hanging requests, and consider preemptively calling likely tools based on context. For example, if a user asks about weather, you might speculatively fetch weather data while the model is still processing.

Practical Implementation Tips

When designing a tool calling system, thoughtful decisions about tool granularity and organization make a huge difference. Tools should be focused and do one thing well. Instead of a generic "database_query" tool, create specific tools like "get_customer_by_id" or "search_products". This reduces errors and makes the model's job easier.

Consider the cognitive load on the model when designing tool schemas. Clear, descriptive names and comprehensive descriptions are crucial. The model relies heavily on these descriptions to understand when and how to use each tool. Avoid abbreviations, be explicit about units and formats, and include examples in descriptions when the usage might be ambiguous.

Tool selection strategies can dramatically impact performance and cost. You don't always need to give the model access to all tools. Consider implementing tool routing based on the query type, dynamically selecting relevant tools based on context, and grouping related tools into toolsets that can be activated together. For example, customer service queries might activate a different set of tools than technical support queries.

Managing conversation state across tool calls requires careful design. Tools often need context from previous interactions, but passing entire conversation histories can be expensive and confusing. Implement a context management system that maintains relevant state between tool calls, summarizes long conversations to extract key information, and passes only necessary context to each tool.

Here's an example of a context-aware tool system:

python

Recent Developments and Future Directions

The field of tool calling is evolving rapidly. Recent developments have significantly expanded what's possible. Parallel function calling, now supported by several providers, allows models to request multiple tool calls simultaneously, dramatically improving efficiency for complex queries. Multi-modal tool calling extends beyond text, with models like GPT-4V and Gemini able to analyze images and trigger appropriate tools based on visual content.

Anthropic's computer use capability represents a paradigm shift in tool calling. Instead of pre-defined API calls, models can now interact with computer interfaces directly, clicking buttons, filling forms, and navigating applications. This opens up integration possibilities with legacy systems that lack APIs.

The trend toward autonomous agents is accelerating. Models are getting better at planning multi-step tool use, maintaining state across long interactions, and recovering from errors without human intervention. Projects like AutoGPT and BabyAGI demonstrate the potential for models to independently pursue complex goals using available tools.

Standardization efforts are gaining momentum. The Model Context Protocol aims to create a universal standard for tool interactions. OpenAPI and JSON Schema are becoming the de facto standards for tool definitions. There's growing interest in tool discovery mechanisms where models can automatically find and learn to use new tools.

Performance improvements continue to make tool calling more practical. Models are getting faster at generating tool calls, with some providers offering specialized routing models that quickly determine whether tools are needed. Inference optimization techniques like speculative decoding and guided generation ensure that tool calls are properly formatted on the first try.

Conclusion

Tool calling transforms Large Language Models from sophisticated text generators into capable agents that can interact with the world. Understanding the complete picture—from high-level concepts through implementation details to error handling—is essential for building robust applications.

The key takeaways from this deep dive include the importance of well-designed tool schemas that guide models effectively, robust error handling that gracefully manages the many failure modes, thoughtful context management that maintains state without overwhelming the model, and security considerations that prevent abuse while enabling functionality.

As the field continues to evolve, we can expect to see more sophisticated tool use patterns, better standardization across providers, and improved model capabilities for autonomous tool selection and error recovery. The distinction between different providers' implementations will likely blur as standards like MCP gain adoption.

For practitioners, the path forward involves starting with simple, well-defined tools and gradually increasing complexity, implementing comprehensive error handling from the beginning, monitoring tool usage to understand patterns and optimize performance, and staying current with rapidly evolving best practices and capabilities.

Tool calling is no longer an experimental feature but a production-ready capability that can transform how we build AI applications. By understanding both the theoretical foundations and practical implementation details, developers can create systems that leverage the full potential of modern LLMs while maintaining reliability, security, and performance.

Additional Resources

For those looking to dive deeper, explore the official documentation from OpenAI, Anthropic, and Google for provider-specific implementations. The Model Context Protocol specification provides insights into the future of standardized tool calling. Open-source projects like LangChain and LlamaIndex offer battle-tested implementations and patterns. The Hugging Face documentation on chat templates and tool use is invaluable for understanding open-source models.

Remember that tool calling is ultimately about bridging the gap between language understanding and real-world action. The best implementations are those that make this bridge invisible to users, providing seamless experiences that feel magical while being grounded in solid engineering practices.

Everything you wanted to know about Tool / MCP Calling in Large Language Models