My morning’s notes from yesterday:

As I was waiting for Claude Code to help me with my goal of modifying DSPy to be able to “RLM everything”, I came across this result from Mastra.AI which describes a SOTA result on LongMemEval using an “observational memory” pre-processing approach.
As you can see, my thought was “maybe RLM will blow this out of the water?” I wasn’t able to find a public result of LongMemEval using Recursive Language Models, so I decided to explore it myself.
out of curiosity after reading this, i started benchmarking rlm and dspy.rlm on longmemeval
— Raymond Weitekamp (@raw_works) February 10, 2026
tl;dr - i think i might have a new "sota memory system" by the end of the day.
cc @DSPyOSS @a1zhang @lateinteraction https://t.co/RS8jVBnlib
The initial results with Gemini 3 Flash Preview and the standalone RLM package weren’t great, but in the past I had noticed that Flash struggled to grok the RLM concept. Gemini 3 Pro fared much better.
Surprisingly - the additional structure enforced by DSPy.RLM was a huge boost, enabling Gemini 3 Flash to match Pro with the regular RLM package.
Most of the rest of the day’s experiments were less successful. I was able to eke out a few more points by attempting to re-create Mastra’s “Observational Memory” as a Pydantic type enforced by DSPy, but unfortunately a few hundred dollars worth of GEPA optimizations didn’t bear any additional fruit.
Surprisingly - with the full structure of DSPy.RLM and the structured observation, Gemini 3 Pro is not actually any better on this benchmark.
Here’s a summary of our experiments:
| # | Experiment | Model | Score | Cost/q | Notes |
|---|---|---|---|---|---|
| 1 | dspy.RLM baseline | Gemini 3 Flash | 87.2% | ~$0.01 | Huge boost over standalone RLM with Flash (58%) |
| 2 | + session tools (naive) | Gemini 3 Flash | 87.3% | $0.032 | Context rot: +31 flips, -32 regressions = net zero |
| 3 | + tools + delegation prompt | Gemini 3 Flash | 89.2% | $0.031 | “Don’t read sessions yourself, delegate” |
| 4 | + observational memory (Pydantic) | Gemini 3 Flash | 89.8% | $0.035 | Our best. Typed observations force structured reasoning |
| 5 | GEPA prompt optimization | Gemini 3 Flash | 87.8% | $0.042 | Regressed. ~$400 spent. Overfits to small val sets |
| 6 | Observational memory | Gemini 3 Pro | ~89.6% | ~$0.20 | Pro ≈ Flash with this scaffold |
And here’s how that stacks up on the LongMemEval leaderboard:
| # | System | Model | Score | Source |
|---|---|---|---|---|
| 1 | Mastra Observational Memory | GPT-5-mini | 94.87% | mastra.ai/research |
| 2 | Mastra Observational Memory | Gemini 3 Pro | 93.27% | mastra.ai/research |
| 3 | Vectorize Hindsight | Gemini 3 Pro | 91.40% | Open-source |
| 4 | dspy.RLM + obs. memory (ours) | Gemini 3 Flash | 89.8% | github |
| 5 | dspy.RLM + tools + delegation (ours) | Gemini 3 Flash | 89.2% | github |
| 6 | Mastra Observational Memory | Gemini 3 Flash | 89.20% | mastra.ai/research |
| 7 | Standalone RLM | Gemini 3 Pro | 87.0% | github |
Not bad for a day’s work, we were able to demonstrate a “Top-5” LongMemEval result with very minimal modifications to dspy.RLM, just some helper functions to process the “multi-chat” sessions.
I think this demonstrates a few exciting things:
- RLMs can be very powerful memory systems without any pre-processing.
- The structured output enforced by the DSPy.RLM implementation is helpful for keeping (at least these Gemini models) “on the rails” vs. the more freeform standalone RLM package.
- Very fast and inexpensive models can achieve near-SOTA results inside the RLM scaffolding, and more speculatively…
- …perhaps RLM as a test-time scaling method is “orthogonal” to model size, in the same way that reasoning models with built-in CoT were able to eke out gains separately from model parameter count.
P.S. — Several improvements to DSPy.RLM were developed during this work and submitted upstream: stanfordnlp/dspy#9295