My morning’s notes from yesterday:

screenshot

As I was waiting for Claude Code to help me with my goal of modifying DSPy to be able to “RLM everything”, I came across this result from Mastra.AI which describes a SOTA result on LongMemEval using an “observational memory” pre-processing approach.

As you can see, my thought was “maybe RLM will blow this out of the water?” I wasn’t able to find a public result of LongMemEval using Recursive Language Models, so I decided to explore it myself.

The initial results with Gemini 3 Flash Preview and the standalone RLM package weren’t great, but in the past I had noticed that Flash struggled to grok the RLM concept. Gemini 3 Pro fared much better.

Surprisingly - the additional structure enforced by DSPy.RLM was a huge boost, enabling Gemini 3 Flash to match Pro with the regular RLM package.

Most of the rest of the day’s experiments were less successful. I was able to eke out a few more points by attempting to re-create Mastra’s “Observational Memory” as a Pydantic type enforced by DSPy, but unfortunately a few hundred dollars worth of GEPA optimizations didn’t bear any additional fruit.

Surprisingly - with the full structure of DSPy.RLM and the structured observation, Gemini 3 Pro is not actually any better on this benchmark.

Here’s a summary of our experiments:

# Experiment Model Score Cost/q Notes
1 dspy.RLM baseline Gemini 3 Flash 87.2% ~$0.01 Huge boost over standalone RLM with Flash (58%)
2 + session tools (naive) Gemini 3 Flash 87.3% $0.032 Context rot: +31 flips, -32 regressions = net zero
3 + tools + delegation prompt Gemini 3 Flash 89.2% $0.031 “Don’t read sessions yourself, delegate”
4 + observational memory (Pydantic) Gemini 3 Flash 89.8% $0.035 Our best. Typed observations force structured reasoning
5 GEPA prompt optimization Gemini 3 Flash 87.8% $0.042 Regressed. ~$400 spent. Overfits to small val sets
6 Observational memory Gemini 3 Pro ~89.6% ~$0.20 Pro ≈ Flash with this scaffold

And here’s how that stacks up on the LongMemEval leaderboard:

# System Model Score Source
1 Mastra Observational Memory GPT-5-mini 94.87% mastra.ai/research
2 Mastra Observational Memory Gemini 3 Pro 93.27% mastra.ai/research
3 Vectorize Hindsight Gemini 3 Pro 91.40% Open-source
4 dspy.RLM + obs. memory (ours) Gemini 3 Flash 89.8% github
5 dspy.RLM + tools + delegation (ours) Gemini 3 Flash 89.2% github
6 Mastra Observational Memory Gemini 3 Flash 89.20% mastra.ai/research
7 Standalone RLM Gemini 3 Pro 87.0% github

Not bad for a day’s work, we were able to demonstrate a “Top-5” LongMemEval result with very minimal modifications to dspy.RLM, just some helper functions to process the “multi-chat” sessions.

I think this demonstrates a few exciting things:

  1. RLMs can be very powerful memory systems without any pre-processing.
  2. The structured output enforced by the DSPy.RLM implementation is helpful for keeping (at least these Gemini models) “on the rails” vs. the more freeform standalone RLM package.
  3. Very fast and inexpensive models can achieve near-SOTA results inside the RLM scaffolding, and more speculatively…
  4. …perhaps RLM as a test-time scaling method is “orthogonal” to model size, in the same way that reasoning models with built-in CoT were able to eke out gains separately from model parameter count.

P.S. — Several improvements to DSPy.RLM were developed during this work and submitted upstream: stanfordnlp/dspy#9295