One Cell Wrong

Task 59129 asks an agent to fix a SUMPRODUCT formula that calculates month-over-month headcount of a company. Claude gets 11 of the 12 months right, but for one month, an employee's departure date falls on a boundary, and it returns 21 instead of 20.

One cell wrong out of 12 means 91.7% accuracy. Result: fail. SpreadsheetBench Verified requires every cell in the answer region to match exactly.

ChatGPT doesn't get close. It writes a text description of the formula into the first cell and leaves the rest blank. 0 out of 12.

In either case, imagine reviewing the output by hand. You are looking at a 22-row spreadsheet with 12 answer cells. You would have to manually recount the items that were open on the last day of that specific month, cross-referencing open dates and close dates, to catch a single off-by-one error.

Tetra is trained to write a verification:

=COUNTIFS(open_date,"<="&EOMONTH(month,0),close_date,">="&EOMONTH(month,0))
In plain english: count items where the open date is on or before end-of-month and the close date is on or after end-of-month.

And the spreadsheet engine Tetra is built on yields immediate feedback: this one is wrong. It iterates; it fixes the boundary condition; and it passes. 12 out of 12. Unlike a human reviewer, the engine checks every cell, every time, with zero fatigue and zero probability of overlooking a single off-by-one in row 14.


Overall Performance

We evaluated Tetra against Anthropic and OpenAI's latest and most powerful agents on SpreadsheetBench Verified, the industry gold standard for agentic spreadsheet manipulation and a NeurIPS datasets and benchmarks spotlight work.

The verified subset comprises of 400 real-world tasks spanning formula computation, data filtering, lookups, text processing, and multi-sheet operations. Unlike the original 912 task set, the verified subset is expert-annotated and is intended as a stronger measure of spreadsheet manipulation performance.

Each task provides an input xlsx, a natural-language instruction, and a golden file. Results are evaluated pass@1: all cells in the answer region must match exactly to the golden file.

The three agents and their provided environments:

Tetra beats Opus 4.6 and GPT-5.4 by a considerable 14 percentage point margin and is now the top-ranked agent on the official SpreadsheetBench Verified leaderboard.

AgentScoreOrg
Tetra94.25%DealGlass
Shortcut.ai86.00%Shortcut.ai
Opus 4.680.25%Anthropic
GPT-5.478.25%OpenAI
Shortcut.ai co-developed the verified subset with the SpreadsheetBench authors

Evaluation Methodology

Each output spreadsheet is compared cell-by-cell against the golden file using openpyxl.load_workbook(data_only=True). Values are normalized through type coercion and checked for exact equality. Formulas are resolved through the Python formulas library and LibreOffice headless; this resolution is strictly additive, with zero regressions. All statistical tests use McNemar's exact test (two-sided) for paired comparisons. Every run is single-pass (Pass@1), with no retries and no cherry-picking. Full session logs, evaluation results, and per-task classifications are available at github.com/arthursolwayne/spreadsheet-agents.

Correcting behavioral defaults: teaching vs. computing

On the full 400 tasks, Opus 4.6 scores 80.25% and GPT-5.4 scores 78.25%, a gap of just eight tasks. McNemar's test on the paired results yields p = 0.45, far from significance.

But those eight tasks mask a more important finding. The two agents do not fail on the same tasks. 47 tasks are passed by Opus but failed by GPT-5.4, and 39 are passed by GPT-5.4 but failed by Opus. The agents have substantially different failure profiles. They just happen to fail on roughly the same number of tasks.

Given the same prompt e.g. "read this spreadsheet, follow this instruction, save the result," the two agents adopt entirely different strategies.

Opus reads the xlsx with openpyxl, computes values in Python, and writes results directly to cells. Without any prompting toward this approach, it defaults to treating the task as a computation problem.

GPT-5.4 writes VBA macros, creates Excel formulas, and places explanatory text in cells. On a 45-task subset where we isolated this behavior, GPT-5.4 with no constraints scored 0/45, not because it lacked the fundamental capability to solve the problems, but because it expressed its answers in a format the evaluator cannot read.

The behavioral defaults are apparent: GPT-5.4 is oriented toward teaching, leaving behind a formula or macro that a human could inspect. Opus is oriented toward executing, computing the answer and writing the value.

Prompt ablation recovers 58 percentage points

We ran a 5-stage prompt ablation on those 45 tasks:

AgentPrompt ConditionPass Rate
GPT-5.4Bare (Codex skill active)0.0%
Bare (skill disabled)17.8%
+ "manipulate xlsx directly"44.4%
+ "use openpyxl"33.3%
+ strict (no VBA/formulas/text)57.8%
Opus 4.6Unconstrained80.0%
TetraNeurosymbolic95.6%
N = 45 tasks. All failed by GPT-5.4 in unconstrained run due to formatting, not computation.

These 45 tasks were identified from GPT-5.4's initial unconstrained run. Every one failed due to VBA macros, Excel formulas, or explanatory text in cells, not incorrect computation. All five stages ran on the identical 45 tasks. The constrained prompt recovers 57.8%: a 58-percentage-point swing from prompt engineering alone. On the full 400 tasks with the constrained prompt, GPT-5.4 reaches 78.25%, close to Opus's 80.25%. The gap between the two agents largely collapses once the behavioral default is corrected.

Neurosymbolic approach accounts for 14 points

Tetra is trained to validate answers with its neurosymbolic engine. When checks fail, the agent iterates.

On the full 400 tasks, Tetra scores 94.25% versus Opus 4.6 at 80.25%. McNemar's test yields p < 0.000001. This is the single result in our study that is unambiguously significant.

The engine does not hallucinate. When =SUM(A1:A10) evaluates to 42 and the agent's output says 43, that is an objective, deterministic signal. Prompt engineering alone cannot provide the same guarantee, because it operates on agent intentions while neurosymbolic verification operates on agent outputs.


Time vs. Accuracy

Accuracy gains trade off with the costs of scaling test-time compute.

Per-task execution time vs. pass rate on SpreadsheetBench Verified (N = 400).

Opus 4.6 averages 41 seconds per task with a median of 29 seconds; 90% of tasks complete under 75 seconds. GPT-5.4 averages 53 seconds with a median of 45 seconds and a longer tail reaching 340 seconds. Both distributions are right-skewed: a small number of complex tasks dominate compute time. Tetra averages 112 seconds per task with a median of 94 seconds. The additional time is spent in the neurosymbolic verification loop, where each task goes through multiple check-and-iterate cycles before producing a final answer.

Whether 2.7x the mean compute cost of Opus 4.6 is worthwhile depends on the cost of a wrong cell. In private credit diligence, a missed value in a 400-row model does not cost compute time; it costs a bad lending decision. For latency-sensitive applications where 80% accuracy is acceptable, Opus 4.6 is the right choice. For high-stakes workflows where every cell matters, the 112-second pipeline is a reasonable tradeoff.


Category-Level Performance

We classified all 400 tasks into 7 categories using LLM-based qualitative coding:

CategoryNTetraOpus 4.6GPT-5.4
Conditional Aggregation8494%77%81%
Lookup & Cross-Reference7794%83%79%
Data Filtering & Sorting7497%74%68%
Formula & Specialized Calc5494%80%81%
Classification & Conditional4398%91%91%
Text & String Processing3995%82%72%
Data Restructuring2997%79%79%

Tetra scores between 94% and 98% across all seven categories. The base agents show more informative variation:

Data Filtering & Sorting is the hardest category for both base agents (Opus 74%, GPT-5.4 68%) and the one where Tetra's uplift is largest (97%). Symbolic verification is most valuable when the agent needs to get sort orders, filter conditions, and row counts exactly right.

Classification & Conditional is the easiest for both agents (91% each). The tasks involve clear rules and binary outputs, leaving less room for formatting or precision errors.

Text & String Processing shows the widest gap between base agents: Opus 82% versus GPT-5.4 at 72%. Tasks requiring exact whitespace preservation, concatenation, and string formatting are where GPT-5.4's tendency to write explanatory text is most costly.


Limitations and Future Work

More agents

We tested two frontier agents. Gemini 3 Pro is the obvious next candidate. Google has invested heavily in structured data reasoning, and Gemini's function-calling behavior may diverge from both Opus's compute-first default and GPT-5.4's teach-first default. Open-source agents (Llama 4, Qwen 3, DeepSeek-R1) lack the RLHF-shaped behavioral defaults of commercial agents and may fail in entirely new ways.

Multi-run variance

Every number in this post is from a single Pass@1 run. We chose this to map one-to-one with our actual evaluation conditions, but we have no variance estimates as a result. During development, we identified at least 10 tasks where pass/fail varied across identical runs. A multi-run experiment would let us distinguish tasks the agent always solves, never solves, and sometimes solves, and better isolate where engineering improvements have the highest marginal value.

Beyond SpreadsheetBench

SpreadsheetBench Verified is the most rigorous public benchmark available, but real-world spreadsheet work involves larger files, more ambiguous instructions, and multi-step reasoning chains that compound errors. We intend to evaluate on tasks closer to production conditions as suitable benchmarks become available.


Conclusion

Tetra scores 94.25% on SpreadsheetBench Verified. Opus 4.6 scores 80.25%. GPT-5.4 scores 78.25%.

The difference between Opus 4.6 and GPT-5.4 is 8 tasks out of 400, with McNemar's p = 0.45. They fail on different tasks but the same number of them. Most of GPT-5.4's failures trace to behavioral defaults, not capability. A constrained prompt recovers 58 percentage points on the affected subset. Without prompt specifications, aggregate scores on this benchmark are neither reproducible nor comparable.

The difference between Tetra and Opus 4.6 is 56 tasks out of 400, with p < 0.000001. This holds across all seven task categories. The neurosymbolic architecture catches errors that prompt engineering alone cannot reliably prevent.

Whether the additional compute is worthwhile depends on the cost of a wrong cell. In private credit diligence, a missed value in a 400-row model does not cost compute time; it costs a bad lending decision. Spreadsheets can check their own arithmetic, and that property is what makes the 14-point gap durable. We are working to extend the same principle to every document in the dataroom.