Code Execution as Reasoning

May 11, 2026

I personally do not care if my AI programs do their reasoning in latent space or code. I want results.

I am currently very intrigued by LongCoT, a new benchmark that is designed to push the limits of what is possible with today’s LLMs. Part of the original intention of the benchmark was to create something that would be both challenging for LLMs and less enmeshed with the details of the harness.

My recent results using DSPy.RLM have caused a bit of drama with some of the leaderboard owners, and the creation of a new tools-enhanced leaderboard. I understand the academic value in having “pure latent space” results without tools, but it just isn’t interesting to me…I want my agents to have tools.

So I set out to give the LLMs their desire path - the python tools that they tried to use in my prior benchmarking experiments.

This works surprisingly well with DSPy.RLM and Opus 4.7, which achieved a new SOTA on LongCoT-mini.

Opus 4.7 + DSPy.RLM → 377/500 (75.4%) on LongCoT-Mini — new top of the leaderboard, and a clear jump over the Sonnet 4.5 + DSPy.RLM 45.4% I posted in April.

Mini	Opus 4.7 + RLM	Sonnet 4.5 + RLM	Sonnet 4.5 vanilla
chess	98/98	85/100	0/100
logic	101/101	106/110	0/110
chemistry	66/98	31/100	13/100
cs	71/97	4/100	0/100
math	41/77	6/95	0/95
total (official /500)	377/500 (75.4%)	227/500 (45.4%)	13/500 (2.6%)

The new runs scored on the 471-question working set after the LongCoT team audited out 29 Mini questions as unsolvable. The official /500 totals here count those audited-out rows as wrong, matching the denominator used for the Sonnet baselines.

The paper splits the five domains into two classes: implicit (Logic, Chess, CS), where the dependency structure can be externalised to code, and explicit compositional (Math, Chemistry), where it can’t. The headline claim is that even with code execution enabled, RLM lifts the implicit class but leaves the compositional class near zero — direct quote: “explicit compositional domains (Math, Chemistry) remain at zero.” My April Sonnet run replicated that shape (math 6/95, hardest cs templates 0/75). Opus + RLM gets math up to 41/77 and cs to 71/97. Not zero.

Special thanks to Prime Intellect for sponsoring inference on this experiment — I promise I will publish my LongCoT environments soon. Even with that support I ran out of credits quickly, so I switched to OpenAI Codex CLI on the latest GPT-5.5 at “xhigh” reasoning, to put my $200/mo sub to work.

Codex CLI + gpt-5.5 “xhigh” → 398/500 (79.6%) on LongCoT-Mini — +21 rows over Opus.

Mini	Codex 5.5 xhigh	Opus 4.7 + RLM
chess	98/98	98/98
logic	100/101	101/101
chemistry	78/98	66/98
cs	67/97	71/97
math	55/77	41/77
total (official /500)	398/500 (79.6%)	377/500 (75.4%)

Two scaffolds, two models, similar bottom line. Codex’s persistent in-sandbox Python loop is grinding harder on chemistry and math, while DSPy.RLM holds a small cs edge. The LongCoT-Mini scoreboard is starting to feel more like a measure of how the agent’s tool loop is wired than of which frontier endpoint is behind it.

Mini is the easy slice. The full benchmark — medium + hard, ~2000 questions, where the dependency DAGs grow long enough to actually bite — is the real test of the paper’s compositional-walls claim. Over the weekend I let Codex loose on it; this time it took multiple Codex subscriptions to finish.

Codex CLI gpt-5.5 xhigh on full LongCoT: 1446/1995 (72.5%) — about 3× the top of the live longcot.ai Open Harness leaderboard, where the LongCoT team’s own GPT 5.2 + rlm run holds #1 at 25.12%. (My April Qwen 3.5 27B + DSPy.RLM run is #2 at 22.18%.)

Full LongCoT · Codex 5.5 xhigh	class	medium	hard	total	Open-Harness #1 (GPT 5.2 + rlm)
logic	implicit	187/195 (95.9%)	165/199 (82.9%)	89.3%	68.3%
cs	implicit	150/150 (100.0%)	210/250 (84.0%)	90.0%	26.7%
chess	implicit	92/150 (61.3%)	200/250 (80.0%)	73.0%	30.6%
math	compositional	110/150 (73.3%)	168/250 (67.2%)	69.5%	0.0%
chemistry	compositional	114/200 (57.0%)	50/200 (25.0%)	41.0%	0.0%
total		77.3%	69.0%	72.5%	25.12%

The compositional class — Math and Chemistry, the domains the paper said would “remain at zero” even with code execution — comes in at 69.5% and 41.0%. cs medium goes 150/150. The walls aren’t walls; with a stronger model and a more aggressive tool loop, they’re just programs.

With the full LongCoT now in hand, I think we can clearly state that the paper’s compositional walls don’t hold. Math and Chemistry — the domains the paper claimed would “remain at zero” even with code execution — come in at 69.5% and 41.0%. The wall isn’t compositional reasoning; it’s the harness used to measure it.

I think there is very clear evidence of something many people have been saying for the past ~18 months: it’s not (just) the model, it’s the harness. Additionally, it is very clear proof that “coding agents” are useful for long horizon tasks that don’t necessarily present themselves as coding problems.