Code and Data for the paper: Paper Page for InfoMosaic-Bench
Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools -- and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering 6 representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Read more about InfoMosaic-Bench and InfoMosaic-Flow in our paper!
A novel benchmark evaluating whether agents can leverage diverse domain-specific tools for multi-source information seeking.
An automated two-stage synthesis pipeline with grounding tasks in domain-wise tools and refining with verification and filtering.
Relying solely on web search is insufficient for precise reasoning. Current agents fail to robustly use domain-wise tools.
The synthesis pipeline is laid on an organizerworkers architecture, where a single organizer acts as the commander, coordinating multiple domain-specific workers.
The dataset contains 621 problems across 5 domains (medicine/biology, finance, maps, video, web), plus an additional set of explicitly cross-domain tasks. In total, InfoMosaic-Bench incorporates 77 distinct tools spanning 7 servers, combined with condition-level supervision, ensuring that the benchmark provides a challenging and reliable testbed for evaluating multi-source information seeking.
We report both Accuracy and Pass Rate. Accuracy measuring strict end-to-end task success, reflecting whether the agent can complete information seeking and reasoning holistically. Pass Rate, in contrast, evaluates provides a more fine-grained view of agent performance based on associated test cases (subquestions with gold answers or subgoal checks).



After analyzing the trajectories of Agents solving dataset problems, we can find that advanced Agents with higher evaluation scores can already exhibit clear, structured thought processes and steps, which are not prompted by humans. Based on our analysis, GPT-5 demonstrates a specific, sequential long-range “Searching-Reasoning-Evaluating trajectory when solving problems in the map domain, which is: “Broad Search → Targeted Information Retrieval → Solution Evaluation → Response Calibration”. The clear action trajectory can be obtained by statistically analyzing the number of tools called by the Agent during different time periods..
@article{du2025infomosaic, title={InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents}, author={Du, Yaxin and Zhang, Yuanshuo and Yang, Xiyuan and Zhou, Yifan and Wang, Cheng and Zou, Gongyi and Pang, Xianghe and Wang, Wenhao and Chen, Menglan and Tang, Shuo and others}, journal={arXiv preprint arXiv:2510.02271}, year={2025} }@dataset{InfoMosaic_Bench, title = {InfoMosaic_Bench}, author = {Dorothydu}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/Dorothydu/InfoMosaic_Bench} }