Tool Use for AI Agents: Actions, Retrievers, and Function Calling with OpenAI, Anthropic, and Google Models

Building capable AI agents isn’t only about choosing a powerful model. The real magic happens when models can reliably use tools: calling functions, retrieving the right information, and taking safe, auditable actions in the real world. In this definitive guide, you’ll learn how to design, implement, and scale tool-using agents across OpenAI, Anthropic, and Google models—plus the patterns, safeguards, and evaluation practices that separate demos from dependable production systems.

Whether you’re just starting or upgrading a mature AI stack, this article gives you a comprehensive blueprint with expert insights, practical recipes, and links to deeper dives on frameworks and multi‑agent orchestration.

What “tool use” means for AI agents
Architectural patterns for tool-using agents
Function calling in OpenAI, Anthropic, and Google
Retrieval tools and modern RAG for agents
Actions and external integrations: design for reliability
Provider capabilities compared: OpenAI vs Anthropic vs Google
Safety, governance, and reliability for tool use
Observability, evaluation, and cost control
Implementation recipes (OpenAI, Anthropic, Google)
Multi-agent orchestration and tooling in practice
Mini‑case study: Support agent with retrieval and actions
Actionable checklist and next steps

What “tool use” means for AI agents

Tool use allows an AI model to extend its built‑in reasoning with capabilities it doesn’t natively possess. Think of it as giving your model a Swiss Army knife:

Function calling is the contract that lets a model request a tool call by emitting structured arguments that your application executes. This is the bridge between language understanding and deterministic actions.
Actions are the actual operations carried out by tools—querying a database, sending an email, creating a support ticket, posting a transaction, or orchestrating a workflow.
Retrievers get the right information into the model’s context—documents, knowledge base entries, logs, CRM records—so the model answers accurately with up-to-date facts.

When people talk about “agents,” they usually mean a loop where the model reasons about a task, identifies which tools it needs, calls them, interprets the results, and keeps going until it reaches a goal. Tool use is what makes that loop productive, safe, and cost-efficient.

A strong agent implementation aligns three axes:

Model prompting and instruction design, 2) high‑quality tools and retrievers with clear contracts, and 3) orchestration that governs sequencing, safety, and observability.

Architectural patterns for tool-using agents

Agents are not one size fits all. Successful teams pick a pattern that matches problem complexity, available tools, and risk profile. Common patterns include:

ReAct (Reason + Act): The model alternates between thinking (reasoning tokens or hidden chain-of-thought) and acting (tool calls). Great for web research, data analysis, and multi-step tasks. Keep thoughts hidden from the end user for safety.
MRKL (Modular Reasoning, Knowledge, and Language): Tools are treated as specialized experts (math, search, code). The model routes subproblems to the right expert. Works well when tools have clear domains.
Plan-and-Execute: A planner model creates a plan; an executor follows it step by step using tools. Better for longer tasks and auditability; easier to checkpoint and recover.
Router/Selector: A lightweight model or rules engine chooses a single tool or policy, then a specialized sub-agent executes. Ideal for low-latency, high-volume workloads.
Supervisor + Sub‑agents: A top-level controller delegates tasks to sub‑agents (researcher, coder, summarizer) and integrates results. Useful for complex processes and team-like collaboration.

In practice, teams often hybridize these. For instance, a Plan-and-Execute superstructure with MRKL-like specialized tools—and a Router that short-circuits trivial queries to a FAQ retriever. The right choice depends on latency and reliability constraints, data sensitivity, and how deterministic your tools are.

Function calling in OpenAI, Anthropic, and Google

Function calling (often exposed as “tools” in APIs) is the contractual heart of tool use. You declare a set of functions (name, description, JSON schema for arguments). The model responds with a tool call request. Your application then executes the function and feeds the result back to the model as a tool response. The model continues reasoning with those results.

Key concepts you’ll see across providers:

Tool declaration: A JSON schema describing each function’s purpose and arguments. Clear descriptions and type constraints are crucial.
Tool call request: The model emits a structured object indicating which function to call and with what arguments.
Tool execution: Your code performs the action—querying systems, updating records, calling an API.
Tool result: You return structured output to the model, which continues reasoning or produces the final answer.

OpenAI, Anthropic, and Google each implement this similarly but with API‑specific details.

OpenAI (GPT‑4o family and successors)

OpenAI popularized function calling, now generalized as “tools.” You can define multiple tools, allow the model to select one or more, and require structured outputs for consistency.

Example tool definition (abbreviated):

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "create_support_ticket",
        "description": "Create a new support ticket in the helpdesk system.",
        "parameters": {
          "type": "object",
          "properties": {
            "customer_id": {"type": "string"},
            "subject": {"type": "string"},
            "priority": {"type": "string", "enum": ["low", "medium", "high"]}
          },
          "required": ["customer_id", "subject"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

OpenAI supports:

Parallel tool calls when tasks logically fan out.
Tool selection modes (auto/specific/none) and structured outputs.
Streaming deltas that include tool call requests for responsive UIs.

For retrieval, you typically perform vector or hybrid search in your app, then supply documents in the prompt. OpenAI also supports structured output formats to reduce post‑processing friction.

Anthropic (Claude 3.5 family)

Anthropic’s Messages API exposes tool use (formerly “beta,” now broadly available). You provide tools with JSON schema definitions. Claude can:

Call zero, one, or multiple tools in a turn.
Return tool_use blocks you can parse deterministically.
Respect tool_choice settings (auto or a named tool).

A minimal tool snippet:

{
  "tools": [
    {
      "name": "lookup_invoice",
      "description": "Fetch invoice details by number.",
      "input_schema": {
        "type": "object",
        "properties": {"invoice_number": {"type": "string"}},
        "required": ["invoice_number"]
      }
    }
  ]
}

Anthropic focuses on helpfulness and safety. Claude 3.5 Sonnet supports large contexts (commonly up to 200K tokens), making it strong for retrieval-heavy tasks. As with OpenAI, your app executes tools and feeds results back as tool_result content for continued reasoning.

Google (Gemini 1.5 family)

Google’s Gemini 1.5 models support function calling via tool declarations (functionDeclarations) and tool configuration. The model emits functionCall objects that you fulfill, then you pass functionResponse messages back. Gemini also offers enterprise features via Vertex AI, including connectors for data sources and optional grounding with Google Search in certain configurations.

A simple tool declaration (condensed):

{
  "tools": [{
    "functionDeclarations": [{
      "name": "search_orders",
      "description": "Search customer orders by email and date.",
      "parameters": {
        "type": "OBJECT",
        "properties": {
          "email": {"type": "STRING"},
          "from_date": {"type": "STRING"}
        },
        "required": ["email"]
      }
    }]
  }]
}

Gemini 1.5 Pro supports extremely large contexts—up to 1 million tokens in widely reported configurations—useful for long documents, transcripts, and codebases. Like the others, it can request multiple calls, and you control how results feed the next reasoning step.

Retrieval tools and modern RAG for agents

Retrieval‑augmented generation (RAG) is the backbone of factual, up-to-date agents. Instead of forcing the model to “remember” everything, you fetch relevant knowledge and pass it to the model at generation time. Well‑designed retrievers improve accuracy, reduce hallucinations, and keep costs predictable.

Core components:

Embeddings and indexing: You convert documents into vector representations for semantic search. Popular stores include Postgres with pgvector, Pinecone, Weaviate, and Elasticsearch/OpenSearch (for hybrid lexical+vector).
Chunking strategy: Split documents into manageable segments so retrievers can return precise context. A common starting range is 150–400 tokens per chunk with 10–20% overlap; tune by document type and question style.
Hybrid search: Combine keyword (BM25) with semantic vectors. This often outperforms either alone, especially for technical queries with important terms.
Metadata filtering: Tag chunks with product, version, geography, customer tier, or PII flags. Filter first, then rank.
Summarization and compression: Before handing retrieved text to the model, trim or summarize to fit context budgets while preserving factual citations.

A robust RAG pipeline for agents also handles:

Freshness: Incremental indexing and data source connectors so updates appear in minutes, not days.
Citations and grounding: Track source and line ranges so the agent can cite and you can audit.
Retrieval evaluation: Measure hit rates (did we fetch the right chunk?), answer quality, and latency.

For many production workloads, “multi‑step RAG” beats naive single‑query retrieval. The agent first clarifies or reformulates the query, retrieves in stages, and uses tools (like a SQL or analytics tool) to compute answers rather than paraphrase.

Actions and external integrations: design for reliability

If retrieval supplies facts, actions drive outcomes: booking a meeting, opening a ticket, writing to a database. As soon as agents can change the world, you need engineering discipline.

Design principles for action tools:

Idempotency: Repeating a tool call should not cause duplicated side effects. Use idempotency keys for external APIs and primary keys/unique constraints internally.
Validation and policy: Validate arguments against JSON schema and business rules before execution. Enforce role‑based access control (RBAC) and scopes for sensitive operations.
Timeouts and retries: Set per‑tool SLAs with exponential backoff. Surface failures to the model with structured, human‑readable error messages.
Transactional boundaries: For multi‑step operations, use sagas or compensating actions. Avoid partial writes that leave systems inconsistent.
Human‑in‑the‑loop (HITL): Require approvals for high‑risk actions. The model can draft; humans confirm.
Logging and audit: Persist every tool request, arguments, result, and final user-facing answer with trace IDs. This is essential for compliance and debugging.

For deterministic tasks (e.g., pricing calculations), keep logic in tools, not prompts. The model decides when to call the tool, but the tool executes the business logic exactly once.

Provider capabilities compared: OpenAI vs Anthropic vs Google

The three leading providers all support tool use, but they differ in emphasis and ergonomics. Here’s a high‑level, non‑exhaustive comparison based on widely documented capabilities as of late 2024.

Capability	OpenAI (GPT‑4o family)	Anthropic (Claude 3.5 family)	Google (Gemini 1.5 family)
Function calling / tools	Yes (JSON schema; auto/specific; parallel)	Yes (tool_use blocks; auto/specific; multiple)	Yes (functionDeclarations; functionCall/Response; multiple)
Structured outputs	Yes (JSON modes, schema adherence)	Yes (JSON schema via tool IO; disciplined outputs)	Yes (JSON via function responses; schemas)
Streaming tool calls	Yes (stream deltas include tool calls)	Yes (streamed content blocks)	Yes (streamed responses with function calls)
Context window	Commonly up to ~128K tokens (model‑dependent)	Commonly up to ~200K tokens (model‑dependent)	Up to 1M tokens reported for Gemini 1.5 Pro
Retrieval approach	App‑side RAG; assistants/tools ecosystem	App‑side RAG; excels with longer contexts	App‑side RAG; Vertex AI connectors and grounding options
Safety features	System prompts, tool whitelists, policies	Constitutional AI emphasis; strong refusals	Safety filters, enterprise controls in Vertex AI

All three can power production agents. In practice, your pick is often driven by your data size (context needs), response style preferences, governance requirements, and existing cloud/vendor commitments.

Safety, governance, and reliability for tool use

Tool use introduces real risk: leaking PII in a search query, posting wrong transactions, or automating at the wrong time. Governance must be engineered from day one.

Recommended guardrails:

Least privilege and allowlists: Agents should see only the tools they need. Enforce per‑role tool allowlists and per‑operation scopes.
Input and output filtering: Redact PII in prompts and tool arguments. Scan model outputs for policy violations before execution.
Deterministic schemas: Use strict JSON schemas for both tool inputs and outputs. Reject or correct malformed arguments.
Rate limits and cost controls: Throttle per user, per tool, and per tenant. Protect downstream APIs and control spend.
Execution sandboxing: For code tools or data transforms, isolate execution with resource ceilings and network egress controls.
Human approval checkpoints: Route high‑risk tools (refunds, deletes, compliance) through approval queues with clear context and a one‑click confirm.

These controls reduce failures and help you pass security reviews. Just as important, they make debugging faster because every decision and action is traceable.

Observability, evaluation, and cost control

You can’t improve what you can’t see. Tool‑using agents require full‑stack observability—from prompt to tool execution to final answer—so you can diagnose failure modes and iterate quickly.

Metrics and practices to adopt:

Tracing and spans: Create a trace per user interaction. Add spans for model calls, each tool call, and retrieval queries. Include timing, token counts, and arguments (redacted as necessary).
Record‑replay: Persist full conversations and tool I/O so you can replay scenarios with new prompts, models, or tools and compare outcomes.
Golden datasets: Curate representative user journeys (happy paths, edge cases, high stakes). Use them for regression checks before every deployment.
Tool health: Track per‑tool errors, P95 latency, and success rates. Alert on spikes.
Cost telemetry: Attribute token and tool costs to users/tenants/features. Watch for drift when models or prompts change.

For evaluation, go beyond subjective “seems good.” Define measurable criteria: factual accuracy with citations, task completion rate, latency targets, and approval rates for HITL steps. Tie them to business KPIs—deflection rate, resolution time, customer CSAT—so you can quantify ROI and justify further automation.

Implementation recipes (OpenAI, Anthropic, Google)

This section shows compact, provider‑specific blueprints for a typical pattern: the model decides when to query a knowledge base and when to create a ticket.

OpenAI: function calling with retrieval handoff

Workflow sketch:

Retrieve candidate documents using your retriever; summarize or compress.
Build the system prompt with instructions and tool descriptions.
Send a user message; allow tool calls; stream for responsiveness.
On tool call, execute and append a tool result message; continue until final.

Minimal Python‑style pseudocode (conceptual):

system = """
You are a support agent. Use `search_kb` to find policies. Use `create_ticket` only when the user explicitly requests or when policy requires escalation. Always cite sources.
"""

tools = [
  {
    "type": "function",
    "function": {
      "name": "search_kb",
      "description": "Semantic search over the knowledge base.",
      "parameters": {
        "type": "object",
        "properties": {"query": {"type": "string"}},
        "required": ["query"]
      }
    }
  },
  {
    "type": "function",
    "function": {
      "name": "create_ticket",
      "description": "Create a support ticket.",
      "parameters": {
        "type": "object",
        "properties": {
          "customer_id": {"type": "string"},
          "subject": {"type": "string"},
          "priority": {"type": "string", "enum": ["low","medium","high"]}
        },
        "required": ["customer_id","subject"]
      }
    }
  }
]

messages = [
  {"role": "system", "content": system},
  {"role": "user", "content": "My invoice is wrong. Can you fix it today?"}
]

# Send to OpenAI chat or responses API with tools.
# On receiving a tool call, run it, append the tool result, and continue.

Tips: keep tool descriptions concise, use structured outputs for consistent JSON, and include clear policies in the system prompt (e.g., when to escalate to a ticket).

Anthropic: tool_use with large‑context retrieval

Anthropic’s Claude handles long contexts well—great for policies, contracts, or manuals. The semantics are similar: define tools, allow tool use, and loop until completion.

Claude‑style content blocks (conceptual):

{
  "system": "You are a helpful support agent. Prefer citing policy excerpts.",
  "messages": [
    {"role": "user", "content": "Customer 123 needs an expedited refund."}
  ],
  "tools": [
    {
      "name": "search_policy",
      "description": "Search refund policies by keyword.",
      "input_schema": {
        "type": "object",
        "properties": {"q": {"type": "string"}},
        "required": ["q"]
      }
    },
    {
      "name": "issue_refund",
      "description": "Issue a refund with reason code.",
      "input_schema": {
        "type": "object",
        "properties": {
          "customer_id": {"type": "string"},
          "amount": {"type": "number"},
          "reason": {"type": "string"}
        },
        "required": ["customer_id","amount","reason"]
      }
    }
  ]
}

When Claude emits a tool_use block for search_policy, run your retriever, compress results with citations, and return a tool_result block. Gate issue_refund behind policy checks and human approval if needed.

Google: Gemini function calls with enterprise connectors

In Google’s stack, you declare functionDeclarations and handle functionCall/Response messages. If you’re on Vertex AI, you can also tap into connectors for BigQuery, Google Drive, and more, or enable grounding where appropriate.

Conceptual flow:

Provide Gemini with the business policy summary in the system prompt and a function to search your KB.
When Gemini requests a function, run your retriever, then respond with functionResponse.
Keep the final answer concise with citations. For irreversible actions, require explicit confirmation.

A minimal functionResponse (illustrative):

{
  "name": "search_kb",
  "response": {
    "results": [
      {"id": "doc-88", "title": "Refund Policy v5", "snippet": "Expedited refunds allowed for Tier A within 72h...", "url": "https://kb.example.com/refunds"}
    ]
  }
}

As with other providers, isolate business logic in your tools, not the prompt. Gemini’s long context makes it strong for rich, multi-document reasoning when paired with precise retrieval.

Multi-agent orchestration and tooling in practice

Single agents can handle many workflows, but multi‑agent systems shine when tasks require specialization, parallelism, or oversight. A common pattern is a supervisor that delegates to sub‑agents—like a researcher, a data analyst, and a customer‑facing explainer—then synthesizes the final answer.

For an overview of the orchestration landscape and how control flows between agents and tools, see our comprehensive guide to agent frameworks and orchestration. If you’re comparing popular libraries, this side‑by‑side comparison of LangChain, LangGraph, AutoGen, and CrewAI in 2026 covers strengths, tradeoffs, and production readiness. And when you’re ready to design robust multi‑agent processes with memory and shared tools, dive into designing multi-agent workflows with LangGraph and CrewAI.

In multi‑agent setups, tool governance becomes even more important. Centralize tool registration and policy enforcement, share retrievers to avoid duplicate indexing, and standardize trace IDs across agents so you can observe the whole system in one view.

Mini‑case study: Support agent with retrieval and actions

Scenario: A B2B SaaS company wants to deflect Tier‑1 tickets and accelerate Tier‑2 resolutions. The agent should answer from policy, surface exact citations, and create tickets only when required.

Design:

Retriever: Hybrid search over product docs and support runbooks. Chunks at ~250 tokens with 15% overlap. Metadata tags: product area, version, customer tier, PII.
Tools: search_kb (read‑only), summarize_policy (read‑only), create_ticket (write), check_sla (read), email_user (write, HITL).
Policy: The agent must cite at least two sources for procedural answers. For refunds or data changes, require human approval.
Orchestration: Plan‑and‑Execute with ReAct. The planner drafts steps: clarify issue, search policy, confirm applicability, then decide on action.

Flow example:

User: “My invoice has duplicate charges. Can you fix today?”
Agent clarifies: “I can help. Which invoice number?”
User: “INV‑44318.”
Agent calls search_kb with query “duplicate charges correction turnaround invoice.”
Retriever returns policy with a 24‑hour correction SLA and steps for verification.
Agent cites the policy, explains the steps, and asks permission to create a ticket.
User agrees. Agent calls create_ticket with subject, priority based on SLA, and embedded citations. Ticket ID is returned. Agent emails confirmation.

Results: The agent solves documentation‑based issues instantly, escalates appropriately with full context, and leaves an auditable trail of every decision, tool call, and citation. Over time, analytics show reduced median time to resolution and a higher self‑serve rate for Tier‑1 questions.

Actionable checklist and next steps

Define tools as contracts: name, description, strict JSON schema, idempotency guarantees, and clear error semantics.
Start with a minimal set: one retriever and one or two actions. Expand based on observed needs.
Choose a provider by workload: long‑context retrieval (consider Anthropic or Google), fast mixed tooling (OpenAI is strong), or enterprise integrations (Vertex AI connectors).
Build a robust RAG pipeline: tune chunk sizes, enable hybrid search, compress with citations, and evaluate retrieval quality.
Enforce guardrails: allowlists, RBAC, PII redaction, approval queues for risky actions, and strict schema validation.
Make everything observable: traces, token and cost telemetry, tool health metrics, record‑replay for regression testing.
Treat prompts as code: version them, test against golden datasets, and roll out with canaries.
Plan for failure: retries with backoff, compensating actions, and clear user messaging when automation defers to humans.
Iterate on real usage: instrument deflection rate, CSAT, and time‑to‑resolution; refine tools and retrieval where they move the needle.

Summary: Build agents that act, retrieve, and deliver value

Tool use is the difference between a chat demo and a dependable AI co‑worker. With function calling, your models can take meaningful actions through stable contracts. With modern retrievers and RAG, they answer with current, verifiable knowledge. And with strong governance, observability, and evaluation, you ship systems your teams and customers can trust.

OpenAI, Anthropic, and Google all provide the primitives you need; the winning solution aligns model choice with well‑designed tools, tuned retrieval, and an orchestration pattern that fits your problem. Start small, measure relentlessly, and expand confidently.

If you’d like expert help designing or upgrading your agent stack—from tool contracts to RAG pipelines to multi‑agent orchestration—our team builds reliable, custom solutions. Let’s scope your use case and get you to production with clarity and speed.

Malecu | Custom AI Solutions for Business Growth

Tool Use for AI Agents: Actions, Retrievers, and Function Calling with OpenAI, Anthropic, and Google Models

Tool Use for AI Agents: Actions, Retrievers, and Function Calling with OpenAI, Anthropic, and Google Models

Table of contents

What “tool use” means for AI agents

Architectural patterns for tool-using agents

Function calling in OpenAI, Anthropic, and Google

OpenAI (GPT‑4o family and successors)

Anthropic (Claude 3.5 family)

Google (Gemini 1.5 family)

Retrieval tools and modern RAG for agents

Actions and external integrations: design for reliability

Provider capabilities compared: OpenAI vs Anthropic vs Google

Safety, governance, and reliability for tool use

Observability, evaluation, and cost control

Implementation recipes (OpenAI, Anthropic, Google)

OpenAI: function calling with retrieval handoff

Anthropic: tool_use with large‑context retrieval

Google: Gemini function calls with enterprise connectors

Multi-agent orchestration and tooling in practice

Mini‑case study: Support agent with retrieval and actions

Actionable checklist and next steps

Summary: Build agents that act, retrieve, and deliver value

Related Posts

Automating Data Extraction and Entry with AI Agents: A Practical Playbook

How We Built a Continuous Evaluation Pipeline for Agentic Systems: A Case Study in Reliable AI

From Static APIs to Dynamic Discovery: How FinFlow Automated Tool Integration and Slashed Costs by 40%

How We Built Custom Tools for AI Agents and Tripled Lead Conversion for a Real Estate Firm