RLMs are SOTA on LongCoT

A few days ago I showed Sonnet 4.5 + dspy.RLM hitting 45.4% on LongCoT-Mini. Exciting results, but a bit pricey for my taste.

So I set out to see what might be possible with some very small models.

First, I wanted to run a 3x2 comparison matrix of doing LLM vs. RLM vs. DSPy.RLM for both Qwen 3 8B and the MIT OASYS RLM finetune.

I will need to save the full analysis for another day (I wasn’t really able to get the finetuned model working), but the meaningful result is that on LongCoT mini, DSPy.RLM can take Qwen 3 8B Instruct from literally 0/507 correct to 33/507 (6.5%). This would be #7 on the leaderboard, from an 8B model!

So my immediate next thought was: “what about Qwen 3.5 9B”? This hit 17.2% on LongCoT mini (3rd place), and was so cheap that I decided to run the full benchmark (all 2500 questions)! (Now using Together AI via OpenRouter, I don’t think their endpoint is quantized but I’m not 100% sure.)

Surprisingly, DSPy.RLM with Qwen 3.5 9B is comfortably SOTA on the full LongCoT at 15.69%, outdoing GPT 5.2 by ~1.6x.

Now I was having too much fun, so I had to run Qwen 3.5 27B (this time via Alibaba Cloud via OpenRouter)…and unsurprisingly, a new LongCoT full king is crowned at 22.18%.

I’m really excited to finally have a meaningful benchmark that can clearly demonstrate the power of RLMs. This clearly deserves a much longer writeup, which I hope to post soon! (And I’m now running Qwen 3.6 35B-A3B at the suggestion of many folks on X.)