InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

Code and Data for the paper: Paper Page for InfoMosaic-Bench

Yaxin Du1, Yuanshuo Zhang1, Xiyuan Yang1, Yifan Zhou2, Cheng Wang1, Gongyi Zou4,
Xianghe Pang1, Wenhao Wang3, Menglan Chen1, Shuo Tang1, Zhiyu Li5, Feiyu Xiong5, Siheng Chen1,5

1Shanghai Jiao Tong University, 2The Chinese University of Hong Kong
3Zhejiang University 4University of Oxford 5MemTensor (Shanghai) Technology Co., Ltd

Abstract

Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools -- and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering 6 representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Read more about InfoMosaic-Bench and InfoMosaic-Flow in our paper!

Examples

Features

🎯Benchmark
InfoMosaic-Bench

A novel benchmark evaluating whether agents can leverage diverse domain-specific tools for multi-source information seeking.

⚙️Pipeline
InfoMosaic-Flow

An automated two-stage synthesis pipeline with grounding tasks in domain-wise tools and refining with verification and filtering.

đź’ˇExperiment
Tool-Selection Lacking

Relying solely on web search is insufficient for precise reasoning. Current agents fail to robustly use domain-wise tools.

Methodology: InfoMosaic-Flow Synthesis Pipeline

Methodology for Info-Mosaic

The synthesis pipeline is laid on an organizerworkers architecture, where a single organizer acts as the commander, coordinating multiple domain-specific workers.

  • Stage 1: Information Seeking composing interdependent constraints and grounding them with verified multi-tool outputs to form initial QA pairs. The synthesizer works by proposing diverse scenarios, gathering verifiable domain evidence via tool-equipped executors, and integrating these results into a coherent multi-sourced problem that requires complex, cross-condition reasoning.
  • Stage 2: Iterative Refinement revising drafts, pruning shortcuts, and enforcing multi-source reasoning. The Refiner drives the verification stage by implementing Condition Decomposing, Condition Fuzzing to eliminate search shortcuts, and Concluding to ensure the final problem requires multi-step reasoning, as verified by a worker limited to web search.
  • Automated checks (Tool-Call Filtering, AnswerEvidence Consistency, and Coherence Filtering) and manual revision by human annotators are conducted to remove ill-formed tasks and improve factual alignment, coherence, and difficulty.
  • DataSet and Benchmark: InfoMosaic-Bench

    InfoMosaic Bench

    Key Statistics for InfoMosaic-Bench

    The dataset contains 621 problems across 5 domains (medicine/biology, finance, maps, video, web), plus an additional set of explicitly cross-domain tasks. In total, InfoMosaic-Bench incorporates 77 distinct tools spanning 7 servers, combined with condition-level supervision, ensuring that the benchmark provides a challenging and reliable testbed for evaluating multi-source information seeking.

    Data Sources GT and Testcase Demo

    Evaluation Metrics

    We report both Accuracy and Pass Rate. Accuracy measuring strict end-to-end task success, reflecting whether the agent can complete information seeking and reasoning holistically. Pass Rate, in contrast, evaluates provides a more fine-grained view of agent performance based on associated test cases (subquestions with gold answers or subgoal checks).

    Experimental Analysis: Agent's Performance in Tool Selections

    Main Results

    Experimental Results
    InfoMosaic-Bench demonstrates that web search alone is insufficient for multi-source reasoning. Our experiments report results for 14 state-of-the-art LLM agents limited to a web-search tool. We observe that current agent system performs poorly on this task. Even the best closed-source model (GPT-5) attains only 38.2% accuracy and a 67.5% pass rate. Moreover, InfoMosaic-Bench reveals stark differences across domains, especially in video, map, and multidomain.

    Domain Tool Analysis & Scaling Analysis

    Experimental Results 2
  • On average, domain tools yield only marginal gains, indicating that the bottleneck is not tool availability but tool use: how agents plan, select, parameterize, and time their calls.
  • Both GPT-5 and GLM-4.5 see clear gains in map and video domains because these tasks depend on structured, exclusive signals (e.g., spatial queries, video metadata) that web search cannot reliably provide.
  • Accuracy drops on multi-domain tasks with many tools, highlighting cross-source orchestration issues: selecting and chaining tools raises planning complexity and error propagation.
  • Failure Mode Analysis

    Experimental Results 2
    We collect and categorize the results of tool calls into four types: usage error (wrong function calling), selection error (wrong tool selections), invalid result (successful but irrelevant/unhelpful calling), and valid result (successful and useful tool calling).

  • As shown by the line, better tool usage yields more useful information and leads to stronger model performance.
  • Tool usage error rate correlates with tool complexity. Bio and Multidomain with larger parameter numbers exhibit higher usage-error rates. Finance and Multi-domain host the largest toolsets and show markedly higher selection error rates, implying larger tool inventories increase selection risk.
  • Most tool results are unhelpful and contribute little to answering the question.
  • InfoMosaic Bench

    Agent Problem-Solving Pattern Analysis

    After analyzing the trajectories of Agents solving dataset problems, we can find that advanced Agents with higher evaluation scores can already exhibit clear, structured thought processes and steps, which are not prompted by humans. Based on our analysis, GPT-5 demonstrates a specific, sequential long-range “Searching-Reasoning-Evaluating trajectory when solving problems in the map domain, which is: “Broad Search → Targeted Information Retrieval → Solution Evaluation → Response Calibration”. The clear action trajectory can be obtained by statistically analyzing the number of tools called by the Agent during different time periods..

    Citation

      
        @article{du2025infomosaic,
            title={InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents},
            author={Du, Yaxin and Zhang, Yuanshuo and Yang, Xiyuan and Zhou, Yifan and Wang, Cheng and Zou, Gongyi and Pang, Xianghe and Wang, Wenhao and Chen, Menglan and Tang, Shuo and others},
            journal={arXiv preprint arXiv:2510.02271},
            year={2025}
        }
      
      
        @dataset{InfoMosaic_Bench,
            title = {InfoMosaic_Bench},
            author = {Dorothydu},
            year = {2025},
            publisher = {Hugging Face},
            url = {https://huggingface.co/datasets/Dorothydu/InfoMosaic_Bench}
        }