Recursive Language Models as Memory Systems

February 11, 2026

My morning’s notes from yesterday:

screenshot

As I was waiting for Claude Code to help me with my goal of modifying DSPy to be able to “RLM everything”, I came across this result from Mastra.AI which describes a SOTA result on LongMemEval using an “observational memory” pre-processing approach.

As you can see, my thought was “maybe RLM will blow this out of the water?” I wasn’t able to find a public result of LongMemEval using Recursive Language Models, so I decided to explore it myself.

out of curiosity after reading this, i started benchmarking rlm and dspy.rlm on longmemeval

tl;dr - i think i might have a new "sota memory system" by the end of the day.

cc @DSPyOSS @a1zhang @lateinteraction https://t.co/RS8jVBnlib
— Raymond Weitekamp (@raw_works) February 10, 2026

The initial results with Gemini 3 Flash Preview and the standalone RLM package weren’t great, but in the past I had noticed that Flash struggled to grok the RLM concept. Gemini 3 Pro fared much better.

Surprisingly - the additional structure enforced by DSPy.RLM was a huge boost, enabling Gemini 3 Flash to match Pro with the regular RLM package.

Most of the rest of the day’s experiments were less successful. I was able to eke out a few more points by attempting to re-create Mastra’s “Observational Memory” as a Pydantic type enforced by DSPy, but unfortunately a few hundred dollars worth of GEPA optimizations didn’t bear any additional fruit.

Surprisingly - with the full structure of DSPy.RLM and the structured observation, Gemini 3 Pro is not actually any better on this benchmark.

Here’s a summary of our experiments:

#	Experiment	Model	Score	Cost/q	Notes
1	dspy.RLM baseline	Gemini 3 Flash	87.2%	~$0.01	Huge boost over standalone RLM with Flash (58%)
2	+ session tools (naive)	Gemini 3 Flash	87.3%	$0.032	Context rot: +31 flips, -32 regressions = net zero
3	+ tools + delegation prompt	Gemini 3 Flash	89.2%	$0.031	“Don’t read sessions yourself, delegate”
4	+ observational memory (Pydantic)	Gemini 3 Flash	89.8%	$0.035	Our best. Typed observations force structured reasoning
5	GEPA prompt optimization	Gemini 3 Flash	87.8%	$0.042	Regressed. ~$400 spent. Overfits to small val sets
6	Observational memory	Gemini 3 Pro	~89.6%	~$0.20	Pro ≈ Flash with this scaffold

And here’s how that stacks up on the LongMemEval leaderboard:

#	System	Model	Score	Source
1	Mastra Observational Memory	GPT-5-mini	94.87%	mastra.ai/research
2	Mastra Observational Memory	Gemini 3 Pro	93.27%	mastra.ai/research
3	Vectorize Hindsight	Gemini 3 Pro	91.40%	Open-source
4	dspy.RLM + obs. memory (ours)	Gemini 3 Flash	89.8%	github
5	dspy.RLM + tools + delegation (ours)	Gemini 3 Flash	89.2%	github
6	Mastra Observational Memory	Gemini 3 Flash	89.20%	mastra.ai/research
7	Standalone RLM	Gemini 3 Pro	87.0%	github

Not bad for a day’s work, we were able to demonstrate a “Top-5” LongMemEval result with very minimal modifications to dspy.RLM, just some helper functions to process the “multi-chat” sessions.

I think this demonstrates a few exciting things:

RLMs can be very powerful memory systems without any pre-processing.
The structured output enforced by the DSPy.RLM implementation is helpful for keeping (at least these Gemini models) “on the rails” vs. the more freeform standalone RLM package.
Very fast and inexpensive models can achieve near-SOTA results inside the RLM scaffolding, and more speculatively…
…perhaps RLM as a test-time scaling method is “orthogonal” to model size, in the same way that reasoning models with built-in CoT were able to eke out gains separately from model parameter count.

P.S. — Several improvements to DSPy.RLM were developed during this work and submitted upstream: stanfordnlp/dspy#9295