A few days ago I showed Sonnet 4.5 + dspy.RLM hitting 45.4% on LongCoT-Mini. Exciting results, but a bit pricey for my taste.
So I set out to see what might be possible with some very small models.
First, I wanted to run a 3x2 comparison matrix of doing LLM vs. RLM vs. DSPy.RLM for both Qwen 3 8B and the MIT OASYS RLM finetune.
I will need to save the full analysis for another day (I wasn’t really able to get the finetuned model working), but the meaningful result is that on LongCoT mini, DSPy.RLM can take Qwen 3 8B Instruct from literally 0/507 correct to 33/507 (6.5%). This would be #7 on the leaderboard, from an 8B model!
Ran Qwen3-8B (8.2B dense, open) on LongCoT-Mini.
— Raymond Weitekamp (@raw_works) April 17, 2026
Vanilla: 0/507.
dspy.RLM: 33/507 (6.5%).
Same model. Same weights. No fine-tuning. The scaffold is doing 100% of the lifting.
Context: leaderboard's smallest open MoE is GLM-4.7 at 358B total / 32B active params. Qwen3-8B is ~4x… https://t.co/uyEK00KTxJ
So my immediate next thought was: “what about Qwen 3.5 9B”? This hit 17.2% on LongCoT mini (3rd place), and was so cheap that I decided to run the full benchmark (all 2500 questions)! (Now using Together AI via OpenRouter, I don’t think their endpoint is quantized but I’m not 100% sure.)
Surprisingly, DSPy.RLM with Qwen 3.5 9B is comfortably SOTA on the full LongCoT at 15.69%, outdoing GPT 5.2 by ~1.6x.
sorry it took me ~50 hrs! now i've got DSPy.RLM as SOTA on LongCOT (Full) by a very large margin, using...
— Raymond Weitekamp (@raw_works) April 18, 2026
...drumroll...
Qwen 3.5 9B!
👑 Qwen3.5-9B + dspy.RLM = 15.69% on LongCoT-full
🔥 ~1.6× GPT 5.2's 9.83% on the same slice! https://t.co/uyEK00KlIb
Now I was having too much fun, so I had to run Qwen 3.5 27B (this time via Alibaba Cloud via OpenRouter)…and unsurprisingly, a new LongCoT full king is crowned at 22.18%.
happy sunday morning. a new LongCoT king is crowned.
— Raymond Weitekamp (@raw_works) April 19, 2026
👑Qwen3.5-27B-Instruct + dspy.RLM
yes that's right, a 27B model more than double GPT 5.2 by using recursive language models https://t.co/b659KY1JC9 pic.twitter.com/9FI9xiDTht
I’m really excited to finally have a meaningful benchmark that can clearly demonstrate the power of RLMs. This clearly deserves a much longer writeup, which I hope to post soon! (And I’m now running Qwen 3.6 35B-A3B at the suggestion of many folks on X.)