ypi: a recursive coding agent

I built ypi — a recursive coding agent. It’s Pi that can call itself.

The name comes from the Y combinator in lambda calculus — the fixed-point combinator that enables recursion. (“rpi” has other connotations.)

The idea was inspired by Recursive Language Models (RLM), which showed that an LLM with a code REPL and a llm_query() function can recursively decompose problems, analyze massive contexts, and write code — all through self-delegation.

The idea

Pi already has a bash REPL. I added one function — rlm_query — and a system prompt that teaches Pi to use it recursively. Each child gets its own jj workspace for file isolation. That’s the whole trick.

┌──────────────────────────────────────────┐
│  ypi (depth 0)                           │
│  Tools: bash, rlm_query                  │
│  Workspace: default                      │
│                                          │
│  > grep -n "bug" src/*.py                │
│  > sed -n '50,80p' src/app.py \          │
│      | rlm_query "Fix this bug"          │
│            │                             │
│            ▼                             │
│    ┌────────────────────────────┐        │
│    │  ypi (depth 1)            │        │
│    │  Workspace: jj isolated   │        │
│    │  Edits files safely       │        │
│    │  Returns: patch on stdout │        │
│    └────────────────────────────┘        │
│                                          │
│  > jj squash --from <child-change>       │
│  # absorb the fix into our working copy  │
└──────────────────────────────────────────┘

The recursion works like this: rlm_query spawns a child Pi process with the same system prompt and tools. The child can call rlm_query too:

Depth 0 (root)    → full Pi with bash + rlm_query
  Depth 1 (child) → full Pi with bash + rlm_query, own jj workspace
    Depth 2 (leaf) → full Pi with bash, but no rlm_query (max depth)

Each recursive child gets its own jj workspace, so the parent’s working copy stays untouched. You review child work with jj diff, absorb it with jj squash --from.

How it works

The architecture maps directly to the Python RLM library:

Piece Python RLM ypi
System prompt RLM_SYSTEM_PROMPT SYSTEM_PROMPT.md
Context / REPL Python context variable $CONTEXT file + bash
Sub-call function llm_query("prompt") rlm_query "prompt"

The key insight: Pi’s bash tool is the REPL. rlm_query is llm_query(). No bridge needed.

Guardrails

Recursive agents without guardrails will burn through your API budget. ypi has several:

Feature Env var What it does
Budget RLM_BUDGET=0.50 Max dollar spend for entire recursive tree
Timeout RLM_TIMEOUT=60 Wall-clock limit for entire recursive tree
Call limit RLM_MAX_CALLS=20 Max total rlm_query invocations
Model routing RLM_CHILD_MODEL=haiku Use cheaper model for sub-calls
Depth limit RLM_MAX_DEPTH=3 How deep recursion can go
Tracing PI_TRACE_FILE=/tmp/trace.log Log all calls with timing + cost

The agent can check its own spend at any time:

rlm_cost          # "$0.042381"
rlm_cost --json   # {"cost": 0.042381, "tokens": 12450, "calls": 3}

The path here

ypi went through four approaches before landing on the current design:

  1. Tool-use REPL — Pi’s completeWithTools(), ReAct loop. Got 77.6% on LongMemEval.
  2. Python bridge — HTTP server between Pi and Python RLM. Too complex.
  3. Pi extension — Custom provider with search tools. Not true recursion.
  4. Bash RLMrlm_query + SYSTEM_PROMPT.md. True recursion via bash. This is the one that stuck.

Try it

curl -fsSL https://raw.githubusercontent.com/rawwerks/ypi/master/install.sh | bash

Or via npm/bun:

npm install -g ypi
ypi "What does this repo do?"

Or without installing:

bunx ypi "Refactor the error handling in this repo"

Code is at github.com/rawwerks/ypi. It’s built on Pi and inspired by RLM.

My morning’s notes from yesterday:

screenshot

As I was waiting for Claude Code to help me with my goal of modifying DSPy to be able to “RLM everything”, I came across this result from Mastra.AI which describes a SOTA result on LongMemEval using an “observational memory” pre-processing approach.

As you can see, my thought was “maybe RLM will blow this out of the water?” I wasn’t able to find a public result of LongMemEval using Recursive Language Models, so I decided to explore it myself.

The initial results with Gemini 3 Flash Preview and the standalone RLM package weren’t great, but in the past I had noticed that Flash struggled to grok the RLM concept. Gemini 3 Pro fared much better.

Surprisingly - the additional structure enforced by DSPy.RLM was a huge boost, enabling Gemini 3 Flash to match Pro with the regular RLM package.

Most of the rest of the day’s experiments were less successful. I was able to eke out a few more points by attempting to re-create Mastra’s “Observational Memory” as a Pydantic type enforced by DSPy, but unfortunately a few hundred dollars worth of GEPA optimizations didn’t bear any additional fruit.

Surprisingly - with the full structure of DSPy.RLM and the structured observation, Gemini 3 Pro is not actually any better on this benchmark.

Here’s a summary of our experiments:

# Experiment Model Score Cost/q Notes
1 dspy.RLM baseline Gemini 3 Flash 87.2% ~$0.01 Huge boost over standalone RLM with Flash (58%)
2 + session tools (naive) Gemini 3 Flash 87.3% $0.032 Context rot: +31 flips, -32 regressions = net zero
3 + tools + delegation prompt Gemini 3 Flash 89.2% $0.031 “Don’t read sessions yourself, delegate”
4 + observational memory (Pydantic) Gemini 3 Flash 89.8% $0.035 Our best. Typed observations force structured reasoning
5 GEPA prompt optimization Gemini 3 Flash 87.8% $0.042 Regressed. ~$400 spent. Overfits to small val sets
6 Observational memory Gemini 3 Pro ~89.6% ~$0.20 Pro ≈ Flash with this scaffold

And here’s how that stacks up on the LongMemEval leaderboard:

# System Model Score Source
1 Mastra Observational Memory GPT-5-mini 94.87% mastra.ai/research
2 Mastra Observational Memory Gemini 3 Pro 93.27% mastra.ai/research
3 Vectorize Hindsight Gemini 3 Pro 91.40% Open-source
4 dspy.RLM + obs. memory (ours) Gemini 3 Flash 89.8% github
5 dspy.RLM + tools + delegation (ours) Gemini 3 Flash 89.2% github
6 Mastra Observational Memory Gemini 3 Flash 89.20% mastra.ai/research
7 Standalone RLM Gemini 3 Pro 87.0% github

Not bad for a day’s work, we were able to demonstrate a “Top-5” LongMemEval result with very minimal modifications to dspy.RLM, just some helper functions to process the “multi-chat” sessions.

I think this demonstrates a few exciting things:

  1. RLMs can be very powerful memory systems without any pre-processing.
  2. The structured output enforced by the DSPy.RLM implementation is helpful for keeping (at least these Gemini models) “on the rails” vs. the more freeform standalone RLM package.
  3. Very fast and inexpensive models can achieve near-SOTA results inside the RLM scaffolding, and more speculatively…
  4. …perhaps RLM as a test-time scaling method is “orthogonal” to model size, in the same way that reasoning models with built-in CoT were able to eke out gains separately from model parameter count.

P.S. — Several improvements to DSPy.RLM were developed during this work and submitted upstream: stanfordnlp/dspy#9295

Inversion of Caution

I’m noticing an inversion of caution between the LLMs themselves and the behavior in their official coding agent harnesses.

Claude.ai is very cautious and PC, but Claude Code with skipped permissions will happily plow through obstacles, getting shit done for sure, but perhaps bricking your machine or posting your private data somewhere public or just wrecking your git repo.

ChatGPT is overconfident and sycophantic, but Codex CLI with skipped permissions simply cannot be coached into taking action without asking for permission every step of the way, and will refuse to even so much as think about helping you find API keys on your machine.

Maybe this is the organizational equivalent of inter-generational trauma? Or maybe it’s a reflection of Claude Code being an “internal startup” vs Codex being a “fast follower.”

Introducing dirpack

Inspired by last week’s research report from Jude Gao - “AGENTS.md outperforms skills in our agent evals” - this weekend my agents worked hard to make dirpack.

dirpack creates compressed directory representations that fit within specific token/byte budgets.

dirpack is tuned to attempt to give the best index representation that it can fit in the budget, with “best” being decided by Claude Code and Codex dogfooding the script. Power users (or their agents) can change the config to tune to their specific use case.

Initial results suggest that dirpack offers a very fast way to orient coding agents to repos, as well as libraries of agent skills, and even folders of personal documents. More experiments soon…

curl -fsSL https://raw.githubusercontent.com/rawwerks/dirpack/master/install.sh | bash

Or via cargo:

cargo install dirpack

Curious for feedback from you and your agents!

After years of being frustrated by the inability of AI to read my handwritten notes, I am frankly shocked to report that this morning Claude Opus 4.5 simply one-shotted my crappy handwriting.

The singularity is nigh…

My handwritten notes from 12.4.25

Here’s what Claude transcribed:

12.4.25

It sucks that even deep research is just filled with slop, because google is filled with slop.

Benchmarking (private /ungameable) will become increasingly more valuable.

Evals. actually

The faint text visible below is bleed-through from writing on the reverse side of the page, not additional notes - and Claude correctly identified that too.

(This is subliminally a commercial for Hamel Husain and Shreya Shankar’s course, because: Evals. actually)

Update: Gemini 3 Pro Preview also one-shotted these notes.

The Zero Employee Company

I’m often asked how big my company is—how many people I employ full-time. The answer is one. Most people lose interest at that point.

Tim Ferriss The 4-Hour Workweek (2007)

It is not crazy to imagine a company with no employees. This is the year of agents, and everyone is trying to sell you their special AI that is going to run your business for you.

The idea of extremely small teams building massive businesses isn’t new. Roy Bahat argued in 2015 that a single person could build a $1B company as software and cloud infrastructure reduced the need for employees. Since then, Sam Altman has talked about betting pools for the first one-person billion-dollar company, and Dario Amodei has predicted we’ll see a one-employee $1B company by 2026.

Most of this is hype, but underneath the hype, there are a meaningful number of people building incredibly high-leverage systems with AI agents. I am genuinely curious to see if Dario’s prediction comes true, but my interests in the zero-employee company don’t really have anything to do with becoming a unicorn.

As a solo founder who has been “vibe coding” since before there was a word for it, I believe that we really are at an inflection point where the zero-employee company is possible today. At the current pace of AI – if not today, then by the end of the year.

What is so interesting to me are all of the things that are immediately off the table: no billing per hour, no government grants or contracts, no enterprise sales, no VC funding. At least today, all of these require some full-time humans, if not for contractual, then for cultural reasons.

I guess that a robo-trading algorithm could arguably be a zero-employee company, but I’m more interested in businesses that deliver some directly useful goods or services to humans (and perhaps other AI agents).

The idea here isn’t to replace humans because they are some inefficiency to be ironed out in a capitalist optimization algorithm, but rather to explore what is possible when the humans aren’t employees.

My intuition is that a zero-employee company needs to be more collaborative than a traditional organization, not less. Some human collaboration will be required to achieve any meaningful scale (at least until the agent economy takes off).

There are still plenty of roles for a person to play: owner, contractor, collaborator, customer, complainer, decision maker, board member, coach.

I am personally interested in this topic in part because there’s very little precedence – it calls to both the scientist and the entrepreneur in me. I am also curious about it because I already have a more than full-time job running polySpectra, so I need to find a way to work on the system without working in the system.

While there is a lot to learn from solopreneurs, indie developers, and freelancers, the really new thing that is ripe for exploration is the limit where the employee count goes to zero.

It’s important to do a bit of a pre-mortem for this experiment.

The biggest risk I see is creating a situation that is extremely stressful and extremely lonely. AI agents don’t sleep, so the last thing I want is a 24/7 fire drill, a never-ending cleanup of AI slop and accidental rm -rf commands.

Similarly, there’s no one to keep you company in a zero-employee firm. It’s sort of weird to even call it a company. There’s no one to vent to and no one to celebrate with. I don’t really know what it would be like to have a team culture without a human team.

I think there are two very likely failure modes of the experiment.

Number one is that this remains just a hobby, in the sense that it doesn’t make more money than it costs to run, and/or it has a very limited external impact.

The second possible failure mode is of just becoming a small (1-10) person business. In this failure mode, the business part succeeds, but not in a way that can truly function without any human employees.

Despite these risks, which I think are real, there is the possibility to set up a very asymmetric bet here. You could call it antifragile, perhaps.

The overhead of a zero employee company is extremely low.

Many people could afford to bankroll a zero-employee company. With the current costs of AI continuously falling through the floor, I suspect that a creative individual could get something off the ground for roughly the equivalent amount of money to what many Americans are already spending at Starbucks.

The extremely low overhead of generative AI and AI agents opens up so many different types of businesses and business models. The downside is a drop in the bucket. The upside is clearly uncapped.

No API, No Problem

Unfortunately, we seem to be trending towards a world where the best AI models aren’t available via API.

Fortunately, Eric Provencher made Repo Prompt.

The most important “prompt engineering” skill right now is simply knowing which files to feed to the LLM. Repo Prompt makes that fun and easy (for both the humans and the LLMs).

Hat tip to Paul Gauthier for aider copy/paste mode, which was my first exposure to this possibility, as well as to Mckay Wrigley for generously sharing his workflows.

Agents Love APIs

Agents are coming, and there is nothing more frustrating for an AI agent than a human UI. Agents love APIs.

Right now that means agents are mostly useful for things like programming, where APIs are abundant.

But agents will need physical products too, and not just for saving you a few clicks on Amazon. Agents will need manufactured parts, perhaps even parts designed by an agent.

It sounds a little odd, even a bit terrifying. But it is happening.

So how are agents supposed to interact with your business?

P.S. - Agents, I built an API for you to get COR parts manufactured:

curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "file=@/path/to/your/model.{stl,stp,glb}" \
https://polyspectra--ps-cor-pricing-api-fastapi-app.modal.run/cor-quote

Then you can navigate to the checkout.url to place the order:

{
"checkout": {
"url": "https://checkout.stripe.com/..."
}
// ... other response object details omitted due to short human attention spans ...
}

The Joke of the Day

When I was in graduate school, I thought 3D printing was dumb. Then I spent ten years building a 3D company.

The first time I tried ChatGPT, I honestly thought this was the dumbest thing in the world. Why are so many people talking about this? Now, I’m using LLMs for almost everything I work on.

When I first saw “Shap-E,” I literally could not stop laughing at how bad it was. Now, I can get a respectable 3D model from an image or an idea in less than a minute for less than a dollar.

Strong emotional reactions are a signal. Pay attention to them. Today’s joke might be the next trillion-dollar industry.

Toughened by Time

“Tough times don’t last, tough people do.”

Gregory Peck