After the LongMemEval experiments in February, I’ve been hungry to find a better benchmark that will actually showcase the power of recursive language models (RLMs) on useful tasks. LongCoT is exactly that: a benchmark built to stress-test long-horizon reasoning.
As soon as I saw the benchmark, I aimed DSPy.RLM at it. (Even before reading the paper.)
The setup
- Model:
claude-sonnet-4-5for both conditions. Samemax_tokens=64000, same judge models, same prompts. - RLM: stock
dspy.RLM3.1.3,max_iterations=50, default Pyodide REPL,sub_lm=lm. - Vanilla: raw Anthropic SDK, single user message, no tools. Leaderboard shape.
- Dataset: LongCoT-Mini, all 500 questions (easy slices across logic / cs / chemistry / chess / math).
The entire RLM surface area is one dspy Signature:
class LongCoTSolve(dspy.Signature):
"""Solve a LongCoT problem.
The `prompt` already contains the full problem statement and the answer
format requirement (always ends with `solution = ...`). Reason through
the problem with the available REPL, then return the final response —
which MUST contain the literal `solution = ...` line as instructed.
"""
prompt: str = dspy.InputField(desc="Full LongCoT problem prompt with answer-format instructions")
response: str = dspy.OutputField(desc="Full final response containing the required `solution = ...` line")
The headline
| Vanilla | RLM | |
|---|---|---|
| Correct | 13 / 500 | 227 / 500 |
| Accuracy | 2.6% | 45.4% |
| Captured cost | $31 | $621 |
On the full 500-row overlap: 219 wrong→right flips, 5 right→wrong, 268 both-wrong, 8 both-right. The vanilla 2.6% lines up with the published Sonnet 4.5 Mini number, so the control is calibrated, not sandbagged.
Per-task
| Task | RLM | Vanilla |
|---|---|---|
| Dungeon · Packaging · Hanoi · Wizards · TrapezoidCounting · Sudoku | 15/15 each (💯) | 0/15 each |
| BlocksWorld | 9/10 | 0/10 |
| Sokoban | 7/10 | 0/10 |
| Chess | 85/100 | 0/100 |
| Chemistry | 31/100 | 13/100 |
| cs / DistMem | 4/25 | 0/25 |
| cs / MaxFlow-MinCut + Hindley-Milner | 0/75 | 0/75 |
| math | 6/95 | 0/95 |
The pattern is coherent: RLM crushes anything whose dependency structure externalises cleanly to code. The orchestrator writes a short Python program, the REPL runs it, the answer comes out. Logic puzzles, Hanoi, Sudoku, chess with Pyodide’s chess module — all 💯 or near it.
The walls are the opposite picture. Hindley-Milner and MaxFlow-MinCut go 0/75 because the orchestrator can’t find a decomposition where subproblems can be usefully farmed out — exactly the “graph-structured dependencies” failure the paper calls out.
And math? The paper’s wall holds at least for now, for Sonnet 4.5. 6/95 on Mini isn’t zero, but it’s terrible. Sonnet 4.5 × dspy.RLM replicates the paper’s math result on a different model and split.
What I think this means for the paper
The paper’s RLM discussion is genuinely thin — one paragraph, one figure, no dedicated table. With that as the bar, cross-model replication is useful:
- Logic, chess, CS wins: replicate and amplify. Same shape on a different frontier model.
- Math stays at zero: maybe? Model swap doesn’t rescue it. But it’s also from a baseline of 6 so you can argue it’s either modest or infinite improvement.
- Chemistry lifts modestly (13 → 31), which is the only spot where I’d push back on the paper’s phrasing — but I’m both a chemist and RLM addict.
P.S. - RLMs are expensive, and supposedly the 500-problem mini version is the easy subset of the full 2500-problem set. So…who wants to fund the Opus 4.7 run?