## Overview
![[deep-search-benchmark 1.png]]
>[!summary]
> Current deep research agents do not reach 50% both in term of comphrensivness and insights
>[!question]
> How good are deep-research agents? Can they be trusted?
>[!idea]
> Build a benchmark of 100 tasks (50 en / 50 cn) to evaluate various models and agents.
## 🔮Insights
>[!insight]
> The current set of agents can't be trusted to provide an exhaustive search as they barely report 50% of the information
>[!insight]
> The current set of agent despite grounding are still hallucinating 20% of the time.
>[!limitation]
> Rating is done using LLM as a judge (Gemini 2.5pro) so there is a compounding error effect there.
## 🧭 Topic Compass
### Where Does X come from?
- AI Agent
- Tool use
- Reasoning models
### What is similar to X?
### What compete with X?
### Where can X lead To?
## 📖 References
### **Paper**
url: https://arxiv.org/abs/2506.11763
![[2506.11763v1.pdf]]