October 16, 2025

Shortcut SOTA on SpreadsheetBench

Building on consistent top performance in spreadsheet automation, our verified benchmark result establishes Shortcut as the leading AI agent in Excel automation.

Shortcut Research Team

Posted by

Shortcut Research Team

SpreadsheetBench Leaderboard

As of October 2025, Shortcut leads SpreadsheetBench as state-of-the-art (SOTA), with the first verified performance result through standardized API evaluation:

RankModelScoreStatusOrganization
1Shortcut.ai59.25%VerifiedShortcut
2Copilot in Excel (Agent Mode)57.2%UnverifiedMicrosoft
3ChatGPT Agent w/ .xlsx45.5%UnverifiedOpenAI
4Claude Files Opus 4.142.9%UnverifiedAnthropic
5ChatGPT Agent35.3%UnverifiedOpenAI

View Live Leaderboard

See real-time rankings and detailed benchmark results on the official SpreadsheetBench website

Visit SpreadsheetBench

Setting the Record Straight

Shortcut has led SpreadsheetBench performance since launch. Upon release, the benchmark only supported internal evaluation and self-reported scores. We're now the first to officially submit our results for evaluation, achieving a verified score of 59.25%.

While we had not publicly shared about our results previously, other spreadsheet agent tools included an estimate of our performance in public reporting for comparison purposes.

State-of-the-Art Performance

Reporting our SOTA performance on the benchmark was made possible by the introduction of a standardized submission process. Verified results ensure:

  • Objective evaluation - No internal biases or cherry-picking of tasks
  • Continually updatable - Ability to submit consistent results over time
  • Distinct scoring criteria - Rigorous judge system ensures fair comparisons

Our results aim to set a standard for transparent performance reporting in Excel AI automation. Introducing standardized API access for evaluation allows for objective and reproducible results.

Our Approach

Shortcut builds multi-agent systems that perform in production environments. We originally saw success on the Excel World Championships and OSWorld, another benchmark on human cognitive-load tasks, many of which include Excel work. Our team of researchers has led innovation in spreadsheet modeling with AI since the start, and we're excited to further validate internal success and customer obsession with this public result.

Understanding SpreadsheetBench

If you're new to benchmarks, read our companion article: What is a Benchmark?

SpreadsheetBench, introduced by researchers in their NeurIPS 2024 paper, evaluates large language model agents on 912 authentic questions gathered from online Excel forums. Unlike synthetic benchmarks, these tasks reflect actual problems that analysts, accountants, and business users encounter daily.

Technical Challenges

SpreadsheetBench tasks exhibit characteristics that make them particularly challenging for AI systems:

  • Multi-table structures - Understanding relationships across sheets without clear keys
  • Messy data - Real spreadsheets don't follow database rules
  • Mixed content - Charts, formatting, objects, and data in the same file
  • Complex formulas - Nested functions, arrays, cross-sheet references
  • Unclear questions - Users ask vague questions that need domain knowledge to interpret

Successfully completing these tasks requires end-to-end capabilities including parsing natural language, analyzing spreadsheet structure, identifying relevant formulas, generating correct solutions, and verifying outputs against expected results.

Real-World Implications

How to Interpret Scores

SpreadsheetBench uses real questions, so scores predict real-world performance. Scores typically start low to leave room for measuring progress. If early systems scored 90%+, the benchmark would be saturated without the ability to track improvements. Our 59.25% represents strong current capability while showing clear targets for continued advancement. As systems improve over time, scores will climb toward 100%, tracking the field's progress on these tasks:

  • Multi-table financial analysis with cross-sheet formulas
  • Nested calculations with error handling
  • Data cleaning across messy formats
  • Report generation with conditional formatting
  • Formula-based workflow automation

Learn more about the implementation and tasks on SpreadsheetBench here.

Our Commitment

Our SOTA performance represents our continuous aim for excellence in spreadsheet modeling. Leading agent performance is just one aspect of the overall product experience that sets us apart as a true leader in the space. We offer the compounding advantage of devoting our research and product efforts to improving Shortcut. As we continue to lead the race to AI adoption in Excel work, we look forward to reporting further improvements in performance on SpreadsheetBench.

About Shortcut

Shortcut was built by Fundamental Research Labs, a research lab focused on building digital solutions for complex problems. Our team, led by a former MIT professor, developed proprietary techniques for spreadsheet understanding that translate to our leading benchmark performance. Learn more here.

Try Shortcut free, install the Excel plugin, or contact us about Enterprise.

FAQs

How is SpreadsheetBench evaluated?

SpreadsheetBench tests task completion on 912 real Excel forum questions. Each task has an input spreadsheet, user question, and expected output. Systems are scored on exact matches in formula correctness, data accuracy, and structure.

What does verified evaluation mean?

Verified means we were tested through standardized API with methods anyone can reproduce. Self-reported results let companies use their own inconsistent methods. Our public API lets researchers independently verify the 59.25% score.

Can other systems obtain verified benchmarks?

Yes. Microsoft, OpenAI, and Anthropic can submit API access for standardized testing. We think transparent benchmarks help everyone by showing true capabilities and enabling fair comparisons.

What makes Excel automation challenging for general AI?

Excel needs understanding of pivot tables, named ranges, cross-sheet references—and generating correct formulas in Excel's language. Plus handling messy data and vague user questions. General LLMs lack this knowledge, resulting in 13-16pp performance gaps.

Will Shortcut work on my specific Excel files?

Shortcut handles any Excel file format and structure. We're trained on real-world spreadsheets with irregular layouts, multiple tables per sheet, complex cross-references, and mixed content. The benchmark tests these exact scenarios, so verified performance indicates strong capability across diverse file types.

How often does Shortcut improve?

We continuously improve Shortcut based on internal and external benchmarking. Our research lab is committed to improving Shortcut's performance daily, allowing rapid iteration on spreadsheet understanding and formula generation. Verified benchmarks let you track our progress objectively over time.