Benchmarking Agents for Cap Table Modeling

We asked 9 venture investors what spreadsheet task is the most time consuming. 8 said cap tables, pref-stack waterfalls, or some model derived from one of those.

Table 1 · Investor pain points

Role	YoE	Pain point
Principal	10	"All modeling derived from cap tables"
Associate	7	Pref stack liquidation / waterfall analysis
Associate	5	Pref stack modeling, customer revenue rollups
Associate	4	Exit analysis, pref stack at liquidation
Associate	3	Cap tables and waterfalls, especially with SAFEs
Associate	3	Connecting accounts in waterfall formulas
Analyst	2	Cap tables, ownership modeling (ESOP, valuation, venture debt, SAFEs)
Analyst	2	"Pref stack / waterfall"
Analyst	1	Trust + hallucination concerns

Respondents anonymized to protect candor; roles and tenure accurate.

The pattern was consistent. Cap tables are the central artifact of every priced round, every diligence cycle, every fund-return calculation. They're also where deal teams burn hours, even days.

The public benchmark for spreadsheet agents, SpreadsheetBench Verified, scores agents on 400 atomic operations averaging 2 minutes. We needed something closer to actual venture diligence work: one dense end-to-end deal model, built from scratch, with formulas wired all the way through.

We constructed a synthetic Series F deal, defined 167 ground-truth output cells, and ran three frontier agents (Tetra, Opus 4.7, GPT-5.5) against it, ten times each. We measured strict cell accuracy, wall-clock time averaging 23 minutes across runs, and audit-readiness: whether the output is a live model that recomputes when you change an input, or a static report masquerading as a model.

Acephalt is partnering with DealGlass to embed Tetra as the spreadsheet agent for VC and family office due diligence, via bespoke API. We conducted this study and concluded that Tetra wins on three margins: highest strict accuracy, the only audit-ready output, and the strongest performance on the structural categories that matter most for cap-table modeling. Its neurosymbolic spreadsheet engine dominates the openpyxl strategy used by Opus 4.7 and GPT-5.5.

Two formulas, or 260,000

Take a single output cell from the deal scenario. At the $700M exit, what fraction of the company does our $10M check need to own to clear a 3× return?

The answer is 2.10%.

Tetra writes the cell as =$C$7/C14. Target proceeds divided by F-class waterfall residual. Two-formula trace. Pass.

GPT-5.5 builds a 130-row × 2,121-column iteration ladder. 259,920 formulas. It collapses with INDEX($AG$3:$AG$130, MATCH(1, $HE$3:$HE$130, 0)) to pick a row matching a target condition. The arithmetic inside the table is internally consistent. The row it lands on returns 9.96%. Five times the truth.

A $10M check sized at five times the right target moves a fund's allocation decision in the wrong direction. That is the cost of getting one cell wrong.

Verifying makes it worse. With Tetra's two-formula chain, you trace the dependency in two seconds. With GPT-5.5's 259,920-formula model, you can't. There is no "verify cell-by-cell" path through 260,000 cells. The analyst has two choices: trust the output, or rebuild the model.

The question this benchmark actually answers: which AI produces output an analyst can take to a principal without essentially rebuilding it over again?

What we measured

What the agent gets, and what we score:

Inputs

cap_table.xlsx Carta export

Stakeholders
Securities
Notes
RoundTerms

prompt.md deal terms

→

Outputs

Assumptions
Pro-Forma
Waterfall 5 exits
Sensitivity 3×3 grid
Return-Solver

167 scored cells

The deal:

Series F, $1.4B pre-money (a slight down round). $30M primary plus $15M secondary tender at 0.85× the round PPS.
Two convertible notes outstanding, mismatched terms. One converts at the cap, the other at the discount-to-PPS. The model has to figure out which.
Pre-money ESOP refresh to 13% of post-money fully diluted. Coupled into the round PPS as a fixed-point.
Anti-dilution on Series D and Series E. Both trigger because the new PPS comes in below their original conversion prices.
Five exit values from $700M to $8B. Conversion election (preference vs convert) as Nash equilibrium across seven priced classes.

We tracked 167 specific output values across pro-forma ownership, anti-dilution ratios, convertible note conversion, waterfall dollars at each exit, sensitivity-grid outputs, and return-solver percentages.

Then we fired three AI systems at it, ten runs each: Tetra, Opus 4.7, and GPT-5.5. Opus and GPT-5.5 ran on their highest reasoning settings.

We also tested a fourth tool, Shortcut from Fundamental Research Labs. It was excluded from this analysis because its backend failed about 60% of fires (V8 heap crashes, 30-minute poll timeouts, intermittent 502s). We landed n=10 valid runs only after firing 33 attempts. Production diligence work doesn't tolerate that.

Methodology, scoring harness, and reproducibility instructions are open-source at github.com/arthursolwayne/cap-table-bench.

Overall performance

Tetra leads Opus 4.7 by 13.5pp and GPT-5.5 by 23pp on mean strict accuracy across 10 runs.

Strict accuracy: percentage of the 167 cells that match canonical truth within ≤0.1pp for percentages, ≤0.1% relative for dollars. The threshold is set so passing means an analyst would sign off without independent verification.

No engine breaks 75% on its best run. The deal is hard by design. It's hard the way real Series F modeling is hard: coupled fixed-points, multi-class anti-dilution, hybrid convertibles, secondary tender. The strict tolerance is also tight. 0.1pp on a 19% ownership value is a $5M difference on a $2.5B post-money.

The Tetra-Opus 4.7 gap is 13.5pp. At n=10 each that's directionally robust, statistically borderline (we'd want n=20 each to firm it up). The Tetra-GPT-5.5 gap is 23pp, statistically significant under standard tests (Cohen's d = 1.45, p = 0.019 after multiple-comparisons correction).

What matters more than the absolute numbers is the ratio of accuracy to architecture.

Time vs accuracy

Each dot is one run. All three engines run in roughly the same 13–46 minute window. Accuracy is what differs. The dotted line on the left is SpreadsheetBench's 1.9-minute median for context. Hover for run details.

All three engines work on the same problem with the same input. Compute time is comparable, but accuracy is not.

SpreadsheetBench Verified, the public benchmark for spreadsheet agents, averages about 2 minutes per task. Our deal takes 20+ minutes per run. The structure to construct is dense: 167 coupled cells, multi-class anti-dilution, hybrid convertibles solved as a fixed-point, waterfall election as Nash equilibrium across seven classes.

GPT-5.5's variance is the giveaway. Same 13–25 minute window across runs, accuracy swings from 7.8% to 59.3%. Some runs produce sparse or unusable workbooks; others build elaborate ~373,000-formula models that don't reliably match canonical truth. Same compute budget, very different output.

Tetra is the only engine that finishes consistently in the 22–46 minute window with every run scoring above 25%. The variance is bounded.

Audit-readiness

Accuracy doesn't wholly capture what we care about. The harder question: when the analyst opens the output and changes an assumption, does the model recompute?

We perturbed each engine's output workbook by changing the Series F primary raise from $30M to $33M. We let the workbook recalculate through whatever machinery it supports. We checked which output cells updated to the correct new values.

Above the diagonal means the workbook keeps recomputing past the strict-pass threshold. Below it means the values look right but go stale when an input changes.

Opus 4.7's 0 of 10 means: in zero of Opus 4.7's outputs did the input value live as a named or labeled cell we could write to. Opus 4.7 wrote workbooks where the raise existed somewhere in the workbook but wasn't reachable as an addressable input. The 38.6% number is the fraction of cells that happened to be passthrough. Values identical between canonical and perturbed truth that didn't need to recompute.

GPT-5.5 was more interesting. The 373,000 formulas in its outputs aren't actually wired to the boundary inputs. They're wired to internal scratch cells. On the rare runs where we could perturb the raise input, recompute hit ~25%. The elaborate model isn't a live financial model.

Tetra was the only engine where 6 of 10 runs exposed enough of the input boundary to even attempt perturbation. On those runs, recompute was double GPT-5.5's. Every cell traces back to an assumption an analyst can read and modify.

That's audit-readiness. Output behaves like a real Excel model because, structurally, it is one.

Category-level performance

Tetra's lead concentrates on waterfall, sensitivity, and return-solver. These are the categories where structural model construction matters. On simpler primitives, Opus 4.7 is competitive.

The 167 cells split across nine categories. Tetra's lead concentrates on the categories where structural model construction matters: waterfall dollars, waterfall elections, sensitivity grid, return-solver. On simpler primitives like individual ownership percentages, single anti-dilution ratios, and single convertible note conversion, Opus 4.7 is competitive.

Tetra's edge shows up in constructing the model.

Limitations

Single task family. One synthetic Series F deal with 167 ground-truth cells. Cap-table modeling spans regimes we did not test (small priced rounds, secondary tenders, bridge financings, restructurings). For broader spreadsheet manipulation coverage, SpreadsheetBench Verified remains the right reference: 400 atomic tasks, expert-annotated, designed for exactly that. This benchmark trades breadth for depth on one regime.

Regime sensitivity. We ran the same engines on contrast scenarios: a templated workbook (named cells already laid out, engine fills values) and small simple cap tables. The agent advantage narrows or inverts on both. When the structure is provided or trivial, foundation-model reasoning closes the gap. The agent earns its overhead only in dense, build-from-scratch regimes.

Strict-accuracy distributions on two scenarios at n=10 each. Left: the dense from-scratch deal. Right: a templated task. The agent edge collapses on templates.

Sample size. n=10 per agent on the headline scenario. The Tetra–Opus 4.7 gap (13.5pp) is directionally robust but statistically borderline. The Tetra–GPT-5.5 gap (23pp) is significant under standard tests. n=20 would firm the borderline result.

Single-pass evaluation. Open-loop runs only. Real diligence involves human-in-the-loop correction cycles. We measure first-pass output, not the analyst-augmented final.

Frozen versions. Tetra-Beta-2, Opus 4.7, GPT-5.5. Model updates may shift these numbers.

Cross-validation against SpreadsheetBench

DealGlass's published SpreadsheetBench Verified result is 94.25%. 14 percentage points ahead of Anthropic's Opus 4.6 (80.25%) and OpenAI's GPT-5.4 (78.25%). Full leaderboard breakdown on the Tetra page.

Our result reproduces something close to that 14pp gap on a much harder, much more domain-specific scenario. Tetra leads Opus 4.7 by 13.5pp on this benchmark.

Same gap, two independent benchmarks, structurally different task families. The reproducibility is the signal. The absolute numbers differ. 94% on SpreadsheetBench, 50% on our benchmark. The tasks are different. SpreadsheetBench is 400 atomic operations averaging about 2 minutes each. Our benchmark is one dense end-to-end deal model that takes 20-25 minutes.

We built this benchmark to address what SpreadsheetBench doesn't:

SpreadsheetBench tests output values, not live-formula propagation. Audit-probe is our addition.
SpreadsheetBench's leaderboard headline numbers anchor on Opus 4.6 and GPT-5.4, both an iteration behind production. We tested Opus 4.7 and GPT-5.5.
SpreadsheetBench tasks are atomic. Our task is end-to-end deal modeling. The structure is what the agent has to invent.

The takeaway from running both: engine ordering holds. What changes on a harder, more domain-specific test is how much harder it becomes to win at all. No engine cracks 75% on our benchmark, even on its best run. That's where AI cap table modeling stands in April 2026.

Acephalt × DealGlass

Acephalt and DealGlass have partnered to bring AI financial modeling to teams that cannot afford error.

Acephalt is integrating with a bespoke DealGlass API to embed Tetra as the spreadsheet engine inside its AI due-diligence product. For the analyst working through a Series F with a memo due that night:

Upload a data room. Acephalt parses the documents and builds the underlying models on Tetra. The output is audit-ready.
Run multiple deals in parallel. Tetra's reliability profile makes this realistic. The alternatives could not.
Every cell traces back. Change an assumption, the model recomputes. The output behaves like a real Excel model because, structurally, it is one.

We tested four AI tools against the cap table modeling problem. Three produce outputs that look like models but do not behave like them. Tetra is the one where the behavior matches the appearance.

Conclusion

Cap tables are the central artifact of every priced round, every diligence cycle, every fund-return calculation. The deal teams who underwrite them deserve tools that match the stakes.

Of three frontier spreadsheet agents tested on a real Series F deal model, only Tetra produced output an analyst could take to a principal without rebuilding it. Highest strict accuracy. Only audit-ready output. Strongest performance on the structural categories that matter most.

If you're a fund or family office evaluating AI tools for diligence work, we're happy to walk through the methodology in detail.