As usual, I can’t help but dive right into anything that comes out of Omar Khattab’s lab. “Machine Studying” is a very interesting blog post that attempts to define “expertise” and “intelligence” based on an agent’s ability to quickly learn new material from a corpus.
“In a way, Machine Learning asks how a system can improve from data when we have a precise objective to optimize. Machine Studying asks what an agent should do when it’s given a declarative corpus and no downstream task.”
I was curious to see what would happen with a modern harness (Codex CLI) and a top model (GPT 5.5 xhigh reasoning).
I was able to get a very high score right off the bat (76%, roughly 3x the highest reported in the original blog), which would suggest that either this combination of model and harness has high expertise on these tasks (already knows them well), or generally has fairly high intelligence for this category of task (ability to gain expertise).
Both of the examples in the current version of StudyBench involve testing an agent on its ability to use a specific version of a programming tool (DSPy & OpenClaw). The dates of the version in the test are from March and April of 2026, so I thought that it might be possible that these versions might already be in the training data of GPT 5.5, although the official page from OpenAI says the knowledge cutoff is Dec 1, 2025.
This led me to the idea that there are at least two different environments to study in: “the library” and “the lab”. In the library - you have books (i.e. documentation) and in the lab you have instruments (i.e. packages). My initial environments for DSPy and OpenClaw gave Codex access to both the documentation and the package. So I asked my buddy Codex (GPT 5.5 xhigh) to build out the remaining 3 environments. (“Lab only” turned out to be way trickier than anticipated as you have to block the package’s self-documentation.)
So now for each OpenClaw and DSPy StudyBench questions we end up with these four environments:
- Closed book: the agent only sees the question. It gets no docs, no source tree, no package metadata, and no runtime to experiment with.
- Lab only: the agent gets no docs, no source tree, and no static study corpus. It can only learn by running small probes against the pinned tool/runtime through the allowed lab surface.
- Library only: the agent can read and search the official static material: pinned source tree, docs, package metadata, and exposed corpus files. It cannot run the package, import it, execute tests, or use a binary/runtime.
- Library + lab: the agent gets both surfaces. It can read the official static material and also run experiments against the pinned environment.
| Treatment | Library | Lab | DSPy | OpenClaw | Mean |
|---|---|---|---|---|---|
| Closed book | no | no | 51.70 | 6.35 | 33.56 |
| Lab only | no | yes | 78.07 | 9.70 | 50.72 |
| Library only | yes | no | 82.97 | 60.30 | 73.90 |
| Library + lab | yes | yes | 85.40 | 62.05 | 76.06 |
So these results suggest a few things:
- (Obviously) having access to both the documentation and the tool itself (library and lab) allows the agent to acquire the most expertise on DSPy and OpenClaw.
- GPT 5.5 appears to have much better built-in expertise (closed book score) about DSPy than OpenClaw. This makes sense, given that DSPy is many years old, and that OpenClaw is both very new and also has had three different names in its brief history.
- “Lab only” (tools but no documentation) led to modest improvements for both environments. “Library only” (docs and source code but no runnable package available) was slightly higher than “lab only” for DSPy, but dramatically higher for OpenClaw. I haven’t done an in-depth analysis of this, but again my suspicion is that it is related to the newness of OpenClaw, the quality of the documentation, and the structure of the StudyBench questions.
- Today’s SOTA coding agents may already be highly effective learners, when given access to documentation, source code, and runnable packages.
I did not make any attempts to control the budget (time, tokens, turns) or to vary the reasoning level (fixed xhigh) - so I cannot compute either the “expertise” or the “intelligence” in the framing of the Machine Studying post. These are obvious next steps, in addition to trying out some different models and harnesses.
My environments and results are available at: github.com/rawwerks/studybench-lab-and-library
Until next time…study hard!