Benchmarking ARI: 76% win rate over OpenAI Deep Research, according to OpenAI's model

Today, you.com is proud to announce how ARI is redefining quality and accuracy in deep research agents. When using OpenAI’s own model as the LLM judge, o3-mini chose ARI’s output over OpenAI Deep Research more than 3 out of 4 times. We’re also launching new enterprise-only features and, for a limited time, unlimited ARI reports for new enterprise customers.
In February, we beta-launched ARI, which stands for “Advanced Research & Insights”, the world’s first deep research engine capable of processing more than 500 sources and delivering polished, professional-grade PDF reports in minutes.
Now, we’re launching ARI Enterprise—with enterprise-only features like custom data integrations and branded deliverables—and an upgraded ARI, with best-in-class performance measured by two objective, transparent, and publicly available benchmarks that raise the bar for the entire field of deep research agents. One benchmark, FRAMES, was published by Google, Meta, and Harvard, and today, we’re publishing a novel benchmark named DeepConsult: a deep research benchmark to evaluate longer-form answers for complex business and consulting queries.
The challenge: you.com’s ARI vs OpenAI’s Deep Research
When presented with two outputs for the same query, one generated by you.com’s ARI and the other by OpenAI’s Deep Research, which would win? We decided to do comparison tests using an LLM judge, OpenAI’s o3-mini, a top-performing reasoning model.
Benchmarking: How we built the eval queries and pair sets
We generated our benchmark query set for DeepConsult based on real user queries relevant to consulting, investing, and deep research. The benchmark focuses on business needs, including company analysis, strategic planning, M&A overviews, market research, and risk assessment.
We then ran each query through both you.com’s ARI and OpenAI’s Deep Research, and asked OpenAI’s o3-mini to rate the outputs. o3-mini evaluated each query’s output six separate times, to eliminate bias and to create a robust composite final outcome: a win, loss, or tie. We used 102 queries and performed 612 tests in total.
The result? ARI had a 76% win rate vs OpenAI Deep Research’s 14% win rate, beating Deep Research 5 to 11 when excluding ties.
Evaluating ARI against the industry standard, FRAMES
How does one even evaluate the “intelligence” of ARI? We chose to evaluate ARI’s output using FRAMES (which stands for “Factuality, Retrieval, And reasoning MEasurement Set), developed independently by researchers at Harvard, Google DeepMind, and Meta. FRAMES was designed to rigorously test LLMs on three core capabilities—fact retrieval, reasoning across multiple constraints, and accurate synthesis of information into coherent responses. We employed Gemini-2.0-flash-001 as the LLM judge to assess the output.
While our beta version of ARI scored 62%, our newest version scores 80% accuracy on the FRAMES eval—a ~30% increase in performance that beats out the top performers in the FRAMES eval set at the time of publishing2. ARI now has the best-known performance of any AI model on FRAMES, ahead of models from OpenAI, Perplexity, and others. For more on methodology, see footnotes below.
Comparing ARI vs OpenAI on citations and sources
On the benchmark, ARI demonstrated significant advantages when compared to OpenAI in various metrics. ARI provided over 3x the number of citations overall, with an average of over 117 more citations per report than OpenAI. ARI also sources information from more than 5x the number of unique web pages and 3.5x more unique domains than OpenAI. ARI reports are also generated with 2.5 charts on average, effectively summarizing available data for clearer insights, positioning ARI as a comprehensive tool for research and reporting.
What’s new for ARI: Key enhancements
1. Dramatically deeper, broader research
ARI now offers 4x greater depth and breadth compared to the beta version of ARI. It delivers 2x more unique citations per report, ensuring better traceability, auditability, and discovery of relevant sources. This drives richer reports with up to 35% more insights and facts and 60% more content in individual sections.
2. Business insights on the data that matters to you most
ARI generates insights on data on the web and stored in files wherever they are—Google Drive, OneDrive, and SharePoint.
With the launch of ARI Enterprise, connect ARI directly to your proprietary data sources for truly organization-specific insights, while knowing your data is safe and secure, with zero data retention. Or, connect ARI to external data sets like Pitchbook, Crunchbase, EDGAR, Capital IQ, and more.
3. Smarter, editable, and more collaborative workflowsARI introduces preliminary research grounding, scanning all available sources before planning. This enables more relevant and targeted workflows. Users can now:
- Review and edit proposed research plans
- Adjust objectives and refine focus
- Ensure alignment before execution
4. Superior instruction following and synthesis
With enhanced synthesis engines, ARI delivers:
- Better adherence to complex instructions
- Richer, more comprehensive answers
- Reports that are readable, logically organized, and actionable
Practical impact: Use cases for decision-makers
ARI for financial analysts
- Market impact analysis: Leverage the latest data to assess trends and opportunities.
- Due diligence: Conduct deep-dive research across hundreds of sources in minutes.
- Automated reports: Generate citation-rich earnings and trend analyses effortlessly.
ARI for consultants
- Industry landscapes: Create comprehensive overviews tailored to client data.
- Technology assessments: Evaluate multi-source data quickly and accurately.
- Editable research plans: Provide bespoke client deliverables with ease.
Why it matters: The evolving enterprise research landscape
Enterprise leaders now face increasing demands that require rapid, accurate answers. ARI enables professionals to ask—and answer—more complex questions with confidence.
ARI Enterprise empowers users to go beyond surface-level research, delivering intelligence and deep insights on their company’s data, which drives smarter decisions and unlocks new opportunities.
We’ve evolved ARI to be the most trusted, collaborative AI partner based on your feedback. As a result, ARI Enterprise is now a full spectrum of business work products that meet your needs.
Get started with ARI today
Curious about ARI and the future of deep research? Try ARI for free today, or book a demo to learn more about ARI Enterprise. Our team is here to help you unlock ARI’s full potential.
Technical Appendix & Footnotes
1. For the deep research benchmark, DeepConsult, we open-source our methods as well as underlying datasets here: https://github.com/Su-Sea/ydc-deep-research-evals. In 3 of the 6 tests for each query, OpenAI’s Deep Research output was “Option A,” and in the rest of the trials, ARI’s output was “Option A” for fairness and to reduce bias. When ARI was preferred at least 4 out of the 6 tests, we marked that as “win”. Our full results, showing wins, ties, and losses against OpenAI Deep Research, are shown in the table below:
2. For the FRAMES benchmark, we evaluated our AI system (ARI) using two methodologies, the Open Deep Search eval method and the original FRAMES method:
Setup #1: Open Deep Search evaluation method
- We follow the method used in the Open Deep Search paper, which involves a SimpleQA prompt using Gemini-2.0-flash-001 as the LLM judge.
- With setup #1, we achieve an accuracy of 79.7%. We report this metric for comparison against other models benchmarked by the Open Deep Search evaluation method.
Setup #2: FRAMES evaluation method
- We follow the method used in the original FRAMES paper, which involves the FRAMES prompt using Gemini-1.5-pro as the LLM judge. Note that we use Gemini-1.5-pro-002, given deprecation of the -001 version of Gemini 1.5 Pro as well as a minor modification to use <xml> delimiters for questions, responses, and ground truths within the FRAMES prompt.
- With setup #2, we achieve an accuracy of 80.7%.