After the LongMemEval experiments in February, I’ve been hungry to find a better benchmark that will actually showcase the power of recursive language models (RLMs) on useful tasks. LongCoT is exactly that: a benchmark built to stress-test long-horizon reasoning.

As soon as I saw the benchmark, I aimed DSPy.RLM at it. (Even before reading the paper.)

The setup

  • Model: claude-sonnet-4-5 for both conditions. Same max_tokens=64000, same judge models, same prompts.
  • RLM: stock dspy.RLM 3.1.3, max_iterations=50, default Pyodide REPL, sub_lm=lm.
  • Vanilla: raw Anthropic SDK, single user message, no tools. Leaderboard shape.
  • Dataset: LongCoT-Mini, all 500 questions (easy slices across logic / cs / chemistry / chess / math).

The entire RLM surface area is one dspy Signature:

class LongCoTSolve(dspy.Signature):
    """Solve a LongCoT problem.

    The `prompt` already contains the full problem statement and the answer
    format requirement (always ends with `solution = ...`). Reason through
    the problem with the available REPL, then return the final response —
    which MUST contain the literal `solution = ...` line as instructed.
    """

    prompt: str = dspy.InputField(desc="Full LongCoT problem prompt with answer-format instructions")
    response: str = dspy.OutputField(desc="Full final response containing the required `solution = ...` line")

The headline

Vanilla RLM
Correct 13 / 500 227 / 500
Accuracy 2.6% 45.4%
Captured cost $31 $621

On the full 500-row overlap: 219 wrong→right flips, 5 right→wrong, 268 both-wrong, 8 both-right. The vanilla 2.6% lines up with the published Sonnet 4.5 Mini number, so the control is calibrated, not sandbagged.

Per-task

Task RLM Vanilla
Dungeon · Packaging · Hanoi · Wizards · TrapezoidCounting · Sudoku 15/15 each (💯) 0/15 each
BlocksWorld 9/10 0/10
Sokoban 7/10 0/10
Chess 85/100 0/100
Chemistry 31/100 13/100
cs / DistMem 4/25 0/25
cs / MaxFlow-MinCut + Hindley-Milner 0/75 0/75
math 6/95 0/95

The pattern is coherent: RLM crushes anything whose dependency structure externalises cleanly to code. The orchestrator writes a short Python program, the REPL runs it, the answer comes out. Logic puzzles, Hanoi, Sudoku, chess with Pyodide’s chess module — all 💯 or near it.

The walls are the opposite picture. Hindley-Milner and MaxFlow-MinCut go 0/75 because the orchestrator can’t find a decomposition where subproblems can be usefully farmed out — exactly the “graph-structured dependencies” failure the paper calls out.

And math? The paper’s wall holds at least for now, for Sonnet 4.5. 6/95 on Mini isn’t zero, but it’s terrible. Sonnet 4.5 × dspy.RLM replicates the paper’s math result on a different model and split.

What I think this means for the paper

The paper’s RLM discussion is genuinely thin — one paragraph, one figure, no dedicated table. With that as the bar, cross-model replication is useful:

  • Logic, chess, CS wins: replicate and amplify. Same shape on a different frontier model.
  • Math stays at zero: maybe? Model swap doesn’t rescue it. But it’s also from a baseline of 6 so you can argue it’s either modest or infinite improvement.
  • Chemistry lifts modestly (13 → 31), which is the only spot where I’d push back on the paper’s phrasing — but I’m both a chemist and RLM addict.

P.S. - RLMs are expensive, and supposedly the 500-problem mini version is the easy subset of the full 2500-problem set. So…who wants to fund the Opus 4.7 run?