RLMs are the new reasoning models

April 20, 2026

Reasoning models were the first clear proof that language model capability can scale with test-time compute. Recursive language models (RLMs) ask what the correct abstraction for spending that compute is.

The insight behind RLMs is obvious in hindsight: it is the direct marriage of two important axes of model capability — reasoning and tool use. This is more radical than it first sounds. RLMs collapse reasoning and tool use into a single inference abstraction: the model treats its own prompt as an environment it can inspect, slice, and recursively query. Context itself becomes the object of computation.

This post is my attempt to explain why RLMs matter. I define what a RLM actually is, place it in the short history of reasoning and tool use, walk through the ~6 months of empirical results that have quietly turned “RLM” from a benchmark trick into the next reasoning paradigm, flag the honest limitations, and point at a few places to start building.

What is a RLM?

A Recursive Language Model, as introduced by Zhang, Kraska, and Khattab, is an inference paradigm in which a language model treats its input prompt as an environment rather than a fixed string. The root LM is given a REPL in which the prompt is bound to a variable it can inspect, slice, and partition programmatically. When it decides a region is worth a closer look, it issues a recursive subcall — to itself or another LM — over that slice and incorporates the result. Recursion bottoms out at the base model’s ordinary forward pass.

One consequence is that input size is no longer a hard ceiling on the computation. The paper reports RLMs processing inputs up to two orders of magnitude beyond the underlying model’s context window and outperforming vanilla frontier LLMs and common long-context scaffolds across four long-context tasks. Beyond long-context answering, recent results demonstrate that RLMs are a powerful paradigm for a wide variety of challenging tasks.

Reasoning & Tool Use — A Brief History

Reasoning and tool use are related, but they are not the same thing.

Reasoning is about how well a model can allocate inference-time compute to a problem: break it down, explore alternatives, verify intermediate steps, backtrack, and choose a better answer. Early reasoning gains came from methods like chain-of-thought, self-consistency, and later tree-search-style prompting. Those methods improve how the model thinks even when it never touches the outside world.

Tool use is about whether a model can decide to call an external function, search engine, calculator, browser, code runner, or UI action; pass the right arguments; interpret the result; and continue. That is partly a reasoning problem, but it is also an interface and reliability problem: schemas, argument formatting, retries, stop conditions, state tracking, and error recovery. Toolformer made this distinction especially clear by treating tool use as something a model could learn during generation.

Historically, the timeline looks roughly like this:

2022: reasoning first, mostly without tools.
Chain-of-thought prompting showed that asking models to generate intermediate reasoning steps could dramatically improve multi-step reasoning. Self-consistency pushed this further by sampling multiple reasoning paths and selecting the most consistent answer. The key lesson was that a large share of “reasoning” gains could come from spending more inference-time compute on the same prompt, not just from adding more knowledge.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — https://arxiv.org/abs/2201.11903
Self-Consistency Improves Chain of Thought Reasoning in Language Models — https://arxiv.org/abs/2203.11171

Late 2022: the first real bridge between reasoning and acting.
ReAct was the key milestone. It framed the model as alternating between reasoning traces and external actions such as retrieval or environment interaction. This was the moment the field started to see tool use not as a one-off API call, but as a loop in which reasoning selects actions and tool outputs reshape the next reasoning step.

ReAct: Synergizing Reasoning and Acting in Language Models — https://arxiv.org/abs/2210.03629

2023: tool use becomes an API discipline, not just a prompting trick.
Toolformer argued that models could learn when to call tools, which tools to call, and how to incorporate the results. Around the same time, vendors began standardizing function-calling interfaces. OpenAI’s June 2023 function calling release was a major product milestone because it made structured tool invocation reliable enough for developers to build on. This improved tool-use reliability faster than it improved deep reasoning.

Toolformer: Language Models Can Teach Themselves to Use Tools — https://arxiv.org/abs/2302.04761
OpenAI, “Function calling and other API updates” — https://openai.com/index/function-calling-and-other-api-updates/

2023 also deepened the separation between reasoning and tool use.
Tree of Thoughts made it even clearer that inference-time reasoning could improve through internal search alone. It let models explore multiple candidate thought branches, look ahead, and backtrack. That is search over reasoning traces. It can be paired with tools, but it does not require them.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models — https://arxiv.org/abs/2305.10601

2024: reasoning models become their own product category.
OpenAI’s o1 launch was the clearest signal. The company described o1 as a model family designed to “spend more time thinking before they respond,” and the initial API announcement explicitly noted that features like function calling were not yet included. That was strong evidence that, product-wise, reasoning and tool use were still separable.

OpenAI, “Introducing OpenAI o1-preview” — https://openai.com/index/introducing-openai-o1-preview/
OpenAI, “Introducing OpenAI o1” — https://openai.com/o1/

2024 is also when agentic tool use got much more serious.
Anthropic’s Claude 3.5 Sonnet emphasized stronger tool use for coding and agentic tasks, and later in 2024 Anthropic introduced computer use: a model interacting with a real computer via screenshots, mouse, and keyboard. This is a good example of the two axes starting to merge into one agentic stack.

Anthropic, “Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku” — https://www.anthropic.com/news/3-5-models-and-computer-use
Anthropic, “Developing a computer use model” — https://www.anthropic.com/news/developing-computer-use

Late 2024 into 2025: vendors start presenting tool use as native, but still distinct from thinking.
Google’s Gemini 2.0 messaging explicitly framed the model family around the “agentic era” and native tool use, while keeping “thinking” as a distinct capability for harder multi-step planning. That split mirrors the real architecture: one layer governs deliberation, another governs interaction with external affordances.

Google, “Google Gemini AI update, December 2024” — https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/

RLMs are the abstraction where that split finally collapses. The past ~6 months of results are what make the case concrete.

Recent RLM Results

The arc of RLM results moves through three successive failure modes of the single forward pass: long context, then memory, then long reasoning. Each has been demonstrated by its own benchmark — Oolong, LongMemEval, and LongCoT respectively — and RLM-style systems have posted leading numbers on all three. Just as importantly, the follow-up work is already splitting into two camps: work that strengthens the original RLM implementation, and work that argues the deeper win is broader externalized program search rather than recursion alone.

Part of what makes RLMs challenging to appreciate is that frankly there aren’t very many benchmarks that really showcase the differences. In particular, I don’t view Oolong and LongMemEval as having much correlation to performance on real world agentic tasks. LongCoT is much more exciting to me, but it is brand new and only time will tell how it holds up.

2024: the memory target appears.
LongMemEval defines the benchmark for long-term interactive memory: 500 questions over sustained chat histories spanning extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. It matters here because it gives RLM-style systems a way to test whether recursive/tool-mediated processing can function as a memory system, not just a long-context hack.

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

October 2025: the original public RLM write-up lands.
In Recursive Language Models, Alex Zhang introduces the core idea: treat the prompt as an external environment, manipulate it through a REPL, and recursively subquery models over slices of context. The post reports an unusually strong early result profile: a GPT-5-mini RLM beats GPT-5 by more than 2× on an Oolong split while being cheaper per query on average, beats ReAct + test-time indexing/retrieval on a BrowseComp-Plus-derived long-context research task, and does not visibly degrade even at 10M+ input tokens.

Recursive Language Models (original blog post)

November 2025: Oolong raises the bar for long-context reasoning.
Oolong is important because it measures something harder than needle-in-a-haystack retrieval: models have to analyze many local chunks and then aggregate them into a global answer. At release, GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all score under 50% on both splits at 128K, making Oolong the clearest early benchmark for the kind of “context as workspace” reasoning RLM is trying to solve.

Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

December 2025: the arXiv paper formalizes RLM.
The Recursive Language Models paper turns the blog’s intuition into a general inference paradigm: prompts are externalized, the LM programmatically inspects and partitions them, and recursive subcalls become part of test-time compute. The headline results are strong: RLMs process inputs up to two orders of magnitude beyond model context windows, outperform vanilla frontier LLMs and common long-context scaffolds across four long-context tasks at comparable cost, and a fine-tuned RLM-Qwen3-8B improves 28.3% on average over its base model.

Recursive Language Models (arXiv, Dec. 2025; revised Jan. 2026)

February 2026: RLM starts posting real “memory” numbers.
In Recursive Language Models as Memory Systems, I reported early LongMemEval results with DSPy.RLM: 87.2% for a baseline Gemini 3 Flash setup, 89.2% with tools + a delegation prompt, and 89.8% with an observational-memory-style structured scaffold. That was a public Top-5-ish result at the time, below Mastra’s 94.87% but already strong evidence that RLM can act as a competitive memory system without a classical retrieval stack. In ypi: a recursive coding agent, I show an earlier tool-use REPL path scoring 77.6% on LongMemEval — a useful datapoint because it shows the gradient from “tool-using agent” to “true recursive scaffold” inside the same implementation lineage.

March 2026: follow-up papers clarify both the strengths and the limits.
Think, But Don’t Overthink reproduces RLM and finds that depth-1 recursion helps on Oolong, but deeper recursion can “overthink,” hurting accuracy and exploding runtime and token cost. Recursive Language Models Meet Uncertainty pushes a sharper critique: recursion itself is not the whole secret, and uncertainty-aware self-reflective program search can improve up to 22% over RLM under the same time budget. Then Coding Agents are Effective Long-Context Processors generalizes the broader thesis: off-the-shelf coding agents outperform published SOTA by 17.3% on average, and on Oolong-Synthetic / Oolong-Real their reported scores (71.75 / 33.73) exceed the paper’s RLM baselines (64.38 / 23.07). That does not really refute RLM; it suggests RLM was the first clearly articulated expression of a larger family of executable, tool-mediated long-context reasoning systems.

April 2026: the theory catches up to the results.
In The Mismanaged Geniuses Hypothesis, Zhang reframes the whole arc: RLM is not just a benchmark trick for long prompts, but a more expressive scaffold for plans written through code execution, recursive subcalls, and tools-as-functions. That is a useful conceptual update because it connects the empirical results back to the bigger claim: reasoning performance is starting to look less like a property of a single forward pass and more like a property of how well a model can manage executable external computation.

The Mismanaged Geniuses Hypothesis

The empirical case moves just as quickly.

April 2026: the benchmark story shifts from long context to long reasoning.
LongCoT introduces 2,500 expert-designed problems for long-horizon chain-of-thought reasoning. At release, the best published models are still under 10% accuracy (GPT-5.2 at 9.8%, Gemini 3 Pro at 6.1%), which makes it an ideal test for whether recursive scaffolds are merely “good at reading long context” or whether they genuinely unlock reasoning depth.

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

April 2026: RLM immediately breaks LongCoT open.
In LongCoT — A benchmark worthy of a RLM’s attention, I showed Claude Sonnet 4.5 + DSPy.RLM reaching 45.4% on LongCoT-Mini versus 2.6% for the same model without recursion/tools. Then in RLMs are SOTA on LongCoT, I show the scaffold doing almost all of the lifting for small open models: Qwen3-8B jumps from 0/507 to 33/507 (6.5%) on LongCoT-Mini; Qwen3.5-9B + DSPy.RLM reaches 15.69% on full LongCoT, about 1.6× GPT-5.2 on the same slice; and Qwen3.5-27B + DSPy.RLM reaches 22.18%, more than 2× GPT-5.2. If these numbers hold up, they are some of the clearest evidence yet that recursive scaffolds can manufacture reasoning performance that is not visible in the base model alone.

The arc is now hard to ignore. Oolong gives the long-context failure mode. LongMemEval gives the memory version. LongCoT gives the long-reasoning version. Across all three, the recurring pattern is the same: when the task requires navigating, decomposing, and aggregating information over a structure that is too large or too entangled for one passive forward pass, recursive tool-mediated processing starts to look less like an implementation trick and more like the next reasoning paradigm.

Challenges with RLMs

A new paradigm is not a clean paradigm. Reasoning models were scoffed at for being too expensive. Early tool calling reliability was horrible. Even some leading reasoning models today are pretty bad at function calling.

RLMs have their challenges. My earliest contributions to the standalone RLM package and the DSPy.RLM implementation were purely practical: budgets, timeouts, managing the recursion depth.

Recursion sounds cool but isn’t always a good thing. Remember those viruses that would make your browser open a million popups until your computer crashed?

Recursion can be scary.

The most obvious limitations right now are cost and time. RLMs are expensive. They can take a long time. Worse, in the naive implementation that time is unpredictable and unbounded, because the model is deciding for itself how to decompose the problem.

Cost and time will be solved. Use smaller or faster models for each sub-call, and balance the agent-native “self-similar” decomposition with deterministic control of the graph topology and timeline.

The harder challenge, or at least the challenge that is more interesting to me personally, is how to get the language models to “act recursively”.

Obviously the concepts of recursion are in the pre-training data. Clearly reasoning and parallel tool calling are behaviors that the post-training incentivizes. Sub-agents are arguably a close behavioral analog to RLMs. And yet anyone who has worked with RLMs will tell you that the models generally suck at behaving recursively. It is not in their nature to decompose their prompt into sub-queries for many other instances of themselves to help solve them.

What’s next?

Well one obvious next step is to explicitly post-train the models in a RLM harness. Alex Zhang et al. are actively working in this area: MIT OASYS on HuggingFace (see e.g. mit-oasys/rlm-qwen3-8b-v0.1).

But what is the reward function for “optimal recursion”? I suspect this is a multi-billion-dollar question.

The most surprising result to me from my last few days of experimenting was how well very small models can do in RLM harnesses. These models are small enough to run on consumer devices, which potentially means that they offer an opportunity to upset the current “balance of power” between the GPU-rich and GPU-poor.

Yes, more money means you can run more. The best GPUs will always be faster. A RLM of Opus is smarter than a RLM of Llama 3. But I cannot help but feel excited and empowered to believe that an individual or consortium running many instances of small models on affordable/legacy/local compute infrastructure can now access model capabilities that are on par with or exceeding those of the most expensive LLMs from the frontier labs. If that is even directionally right, the frontier stops being a place only the largest labs can reach.

Getting Started with RLMs

Here are just a few of the many ways to get started with RLMs:

alexzhang13/rlm — the reference implementation from Alex Zhang and the RLM paper authors; the cleanest place to read the core recursion loop.
dspy.RLM — the DSPy integration, which exposes RLM as a composable module inside larger DSPy programs and is what I’ve been using for most of my own experiments.
ax-llm/ax — a TypeScript DSPy-style framework with first-class RLM support via AxAgent-driven recursive decomposition, bounded sub-queries, and a persistent JS runtime.
rawwerks/rlm-cli — my CLI wrapper around rlm with directory-as-context, JSON-first output, and self-documenting commands, for running RLMs against local repos and folders.
rawwerks/ypi — my recursive coding agent built on Pi: one rlm_query tool, one rlm_map fanout helper, and per-child jj workspaces for isolated recursive execution.

P.S.

I almost forgot: fractals.