DeepResearch Bench A Comprehensive Benchmark forDeep Research Agents

## Overview ![[deep-search-benchmark 1.png]] >[!summary] > Current deep research agents do not reach 50% both in term of comphrensivness and insights >[!question] > How good are deep-research agents? Can they be trusted? >[!idea] > Build a benchmark of 100 tasks (50 en / 50 cn) to evaluate various models and agents. ## 🔮Insights >[!insight] > The current set of agents can't be trusted to provide an exhaustive search as they barely report 50% of the information >[!insight] > The current set of agent despite grounding are still hallucinating 20% of the time. >[!limitation] > Rating is done using LLM as a judge (Gemini 2.5pro) so there is a compounding error effect there. ## 🧭 Topic Compass ### Where Does X come from? - AI Agent - Tool use - Reasoning models ### What is similar to X? ### What compete with X? ### Where can X lead To? ## 📖 References ### **Paper** url: https://arxiv.org/abs/2506.11763 ![[2506.11763v1.pdf]]