Beyond Prompting: Harness Engineering for Learning Systems

What problem was I trying to solve?

A couple of months ago, I wrote an article that discussed my attempts to learn with AI tools, the pain points I encountered, and the system I built to address these pain points.

To recap, in late 2025 I started to seriously consider how AI tools like ChatGPT could accelerate my learning but, in my experience of using them out of the box, I ran into some rather large failure modes:

Lack of error tracking
Statelessness
No information verification

As a result, I ended up building a custom command in Claude Code called /ultralearn to address these issues. I added extra software around it to implement features like information verification, encoded deliberate difficulties into sessions, and tracking misconceptions. I've added more tooling around it to enhance its capabilities even further, some of which I'll be discussing later. This practice of wrapping LLM's with extra tooling has a formal name: harness engineering.

What is harness engineering?

Harness engineering refers to the practice of designing the environment an LLM operates in to extend its capabilities. It sounds complicated, but it can be as simple as creating a CLAUDE.md file that you use to give Claude context about your project!

Many developers stop at custom prompts, but harness engineering allows you to design your entire interaction with the LLM. The prompt is just a single component of this.

This term was initially coined by Mitchell Hashimoto in early 2026 on his blog where he used it to improve the capability of AI agents in generating code.

However, this core idea of designing the environment around the LLM to enhance its capabilities can apply to domains other than coding! In my case, I've been using it to design my ultralearning system.

How did I apply harness engineering to my learning system?

I built an ultralearning system on top of Claude Code. It started off simple as it was just a custom command that instructed Claude to do research on a topic I wanted to learn about and then question me about it instead of dumping the information on me (this is called the generation effect).

However, over time, it started to grow as I wanted to address the pain points that I experienced when trying to learn with other AI tools, which I'll be going over below. This is what the current architecture looks like:

Ultralearning System Architecturev2

1/8

Entry Point

/ultralearn "topic"

User launches a learning session from the CLI

Main Session — The Coach

Socratic Tutor

The session itself acts as the coach. Drives metalearning, Socratic questioning, deliberate practice, and review. Now with plateau detection and self-reflection.

Core — Delegated via Task tool

Subagent

Verification Gate

Fact-checks claims against docs

Subagent

Assessment Agent

Generates typed, calibrated questions

Subagent

Reference Clerk

Produces deep-dive explanations

Specialized (v2)

Subagent

Demo Generator

Interactive HTML demos for persistent weak spots

Subagent

Capstone Architect

Portfolio projects from mastery profile

Enforcement Hooks (v2)

Hook

Verification Counter

Warns at 5+ unverified messages

Hook

Checkpoint Guard

Blocks unverified cards in saves

Hook

Cross-Ref Enforcer

Blocks stale cross-references

Hook

Subagent Tracker

Monitors agent delegation patterns

Deterministic Python CLI Tools (v2)

Tool

SRS Engine

SM-2 scheduling: init, sync, due, grade, forecast

Tool

Assessment Engine

Mastery-aware typed question generation

Tool

Weak Spot Writer

WS/CE/CP entries with file boundary enforcement

Tool

Plateau Detector

Identifies stalling progress from SRS data

Persistence

Artifact Clerk

Manages all file writes across 6 artifacts. Validates format and triggers git commits.

Version Control

Learning Git

Safe-only git: init, add, commit. No destructive commands. Called by the clerk.

Artifact

Knowledge Map

5-level mastery ladder per concept

Artifact

Weak Spots

Categorized learner errors (WS-N)

Artifact

Coach Errors

CE/CP accountability log

Artifact

Cards + SRS

Typed flashcards with SM-2 sidecar

Artifact

Journal

Session notes, savepoints, reflections

Artifact

Cross-Refs

Sharded concept registry across projects

Scheduling Engine

SRS Engine — SM-2

Mastery-aware scheduling. Maps concept level to question types: free recall → conceptual → application → analysis → transfer.

Continuity

Next Session — Resume

Loads savepoint, identifies due cards, cross-refs stale concepts, greets with full context.

You start a session with /ultralearn "topic". The skill activates and the main session itself becomes the Socratic coach — there is no separate coach agent.

What does harness engineering unlock for me?

Plateau Detection

Learning plateaus are points in your learning journey where you start to stagnate. This is usually described as just a feeling one has where it "feels" like there's no significant improvement in one's understanding, which can happen for a number of reasons. For me, this feeling was brought out much faster because of /ultralearn where the daily graded assessments would make it apparent that some concepts of an ultralearning project just wouldn't click for me or some assessment items were easy to get right because I memorized the answer, not because I understood it.

To overcome these plateaus, I needed to do two things:

Detect when I reach a plateau
Execute some intervention to get me out of the plateau

Intervention was rather straightforward. I'd initially be taught the information via Socratic dialogue with the LLM and this was how it would be re-introduced if the assessment showed I still struggle in that topic. However, doing this over and over again for multiple sessions and still yielding the same results made it clear that this approach wasn't working. In this case, the way the information is presented is revisited (e.g. instead of text explanation, a visual demonstration is made) to try and make it click.

Detection was the difficult part. Many times, learning plateaus are only detected when you "feel" that you're in one, which can be difficult to do when it's not something you're actively on the lookout for.

However, with harnesses, this problem became solvable. The learning artifacts in a project are rife with information that keeps track of my progress:

Spaced Repetition System (SRS) artifact contains flashcards and a history of my grades for each
Weak Spot artifact contains list of active weak spots and how long they've been active for
Journal artifact tracks history of learning sessions and the mode of teaching used in each

With this information, I expose a plateau detection tool to the LLM, which is just a python script that reads these artifacts, checks for conditions which signal that a learning plateau has been reached, and returns the type of intervention that the LLM should perform to overcome it.

JSON

1{
2  "version": "1.0",
3  "algorithm": "sm2",
4  "topic": "cards.md",
5  "created": "2026-05-04",
6  "last_sync": "2026-05-29",
7  "cards": {
8    "card-1": {
9      "card_number": 1,
10      "question_preview": "What are the two sub-problems of authentication?",
11      "question_hash": "e5f231f3",
12      "easiness_factor": 2.6,
13      "interval": 16,
14      "repetitions": 3,
15      "lapses": 0,
16      "status": "review",
17      "last_review": "2026-05-18",
18      "next_review": "2026-06-03",
19      "created": "2026-05-04",
20      "retired": false,
21      "review_history": [
22        {
23          "date": "2026-05-05",
24          "quality": 4,
25          "ef": 2.5,
26          "interval": 1
27        },
28        {
29          "date": "2026-05-06",
30          "quality": 5,
31          "ef": 2.6,
32          "interval": 6
33        },
34        {
35          "date": "2026-05-18",
36          "quality": 4,
37          "ef": 2.6,
38          "interval": 16
39        }
40      ]
41    },
42	// ...other cards
43  }
44}

With this workflow, plateau detection is achieved automatically with the data to back it up. Also, it's completely deterministic! The LLM only calls the tool and responds to the output provided.

Stronger Verification Enforcement

A perennial problem I face with this system is the inconsistent enforcement of information verification. There's a verification gate subagent that is supposed to perform a web search to verify the claims that the LLM will teach in a specific session, but there are instances where verification doesn't fire when it should.

For example, if I have a follow-up question on a concept that leads to a tangential discussion, this can lead to the LLM putting forth claims that may not have been part of those initially verified.

To combat this, I leveraged hooks, which are user-defined scripts that are fired at specific points in a session's lifecycle. The current implementation is that every five turns after a call to the verification gate subagent, a warning is emitted to the LLM to prompt it towards running another round of verification. However, this is only a soft guard, meaning that the session can continue without running another verification.

Revamped Weak Spot Tracking

In the initial version of this system, I had an artifact that tracked misconceptions I had regarding a topic I was learning so that the LLM will know what to drill down on in a session. However, as I'd progress through sessions, I came to realize that misconceptions are a misnomer. What I should really be tracking are weak spots. Misconceptions are one type of weak spot, but others exist as well such as fragile recall of a concept or the inability to apply the concept in practice.

As a result, I changed the misconceptions artifact (which is just a misconceptions.md file) to a weak spots artifact (weak-spots.md) instead and had each weak spot's category recorded along with other metadata. This allows the LLM to formulate the appropriate type of intervention to drill down on this weak spot.

weak-spot.md

1## WS-1 — Auth stack layer numbering — fragile recall
2
3**Category:** fragile-recall
4**Session:** 2
5**Last tested:** S6
6**What happened:** Learner blanked on layers 2-4 in card 5 SRS review (S2), mislabeled session hijacking as layer 4 (should be 3) in card 6 (S2). **S3 escalation:** Card 6 lapsed again in SRS batch 2 (3rd occurrence, mislabeled layer 2 instead of 3). Drill correction with new mnemonic "Hijack the Session, Hit layer 3" — re-test correct. **S4 improvement signal:** Card 6 scored 4 (clean pass after 3-day gap), first clean pass after 2 lapses. Mnemonic "Hijack the Session, Hit layer 3" holding across retrieval contexts.
7**Correct model:** Auth stack layers: (1) Transport security (HTTPS), (2) Credential verification (hashing/comparison), (3) Session/token management, (4) Access control. Each layer depends on the one below. Hashing protects layer 2. Session hijacking attacks layer 3. **Working mnemonic (S3+):** "Hijack the Session, Hit layer 3" — context-specific (attack-outcome mapping) vs abstract (TCSA). Previous TCSA mnemonic (S2) did not prevent re-lapse, but new mnemonic is holding.
8**Why it matters:** Layer numbering confusion creates gaps in threat modeling. Misplacing attacks to wrong layers means missing the correct mitigation. In code review and architecture discussions, layer confusion will surface as inaccuracy.
9**Cards:** card-5, card-6
10**Concepts:** auth-stack, layers, session-hijacking, credential-verification
11**Status:** resolved
12
13### History
14- **S2:** First observed: blanked on layers 2-4 (card 5, score 2), mislabeled layer 3 as 4 (card 6, score 2). Mnemonic TCSA introduced.
15- **S3:** Card 6 lapsed AGAIN (3rd occurrence, score 2, mislabeled layer 2 instead of 3). Drill with new mnemonic "Hijack the Session, Hit layer 3" — re-test correct. Re-lapse after drilling suggests need for deeper encoding (spaced retrieval, retrieval-practice variability, or cue-dependence).
16- **S4:** Card 6 scored 4 (first clean pass after 2 lapses, 3-day gap). Mnemonic "Hijack the Session, Hit layer 3" holding across retrieval contexts. Improvement signal — consider downgrading from "active" if next session also clean.
17- **S5:** Card 6 scored 4 (2nd consecutive clean pass, 2-day gap). Mnemonic stable. **Status recommendation:** Downgrade from "improving" to "resolved" pending S6 confirmation.
18- **S6:** Card 6 scored 5 (3rd consecutive clean pass, 7-day gap). Q-7 warm-up assessment correct layers (4/5). Mnemonic "Hijack the Session, Hit layer 3" fully stable. **Resolved.**
19
20---
21

It's updated when the LLM detects a poor answer/explanation from my end when responding to an exercise/assessment. Once it's detected, a Python tool (weak_spot_writer.py) is invoked to write to the file to ensure that the formatting is consistent.

Self-improving Tutor

As much as I've been obsessing over making my system run smoothly, each session always surfaces a problem to be solved. For example, the LLM tests me on a concept that wasn't covered yet. I always made it a habit to investigate these issues in the same session they surfaced in to still have access to the full context, but this investigation can be easy to forget.

To address this, when a learning session ends, I instructed the LLM to automatically log errors it made in the session to a coach-errors.md artifact to improve observability, but to improve, the LLM needs to act on these errors. To do so, a complementary coach-insights.md artifact is maintained that contains the LLM's reflections on the errors observed to figure out how to avoid these problems in future sessions.

coach-errors.md

1## CP-1 — Tested untaught material in retrieval pulse
2
3**Kind:** CP (process failure)
4**Session:** S3
5**Description:** Asked about bcrypt 72-byte input limit during end-of-session retrieval pulse. This concept was in pre-session verification claims but was never explicitly taught or carded. Violated protocol: ad-hoc retrieval questions should only test covered material.
6**Concepts:** 72-byte-limit, retrieval-pulse
7**Related cards:** none
8**Impact:** Low — learner correctly identified the gap ("I don't recall discussing this"). No wrong information was taught.
9**Corrective action:** Check cards.md before composing retrieval pulse questions.
10
11### History
12- **S3:** First occurrence
13
14---
15
16## CP-2 — Tested untaught material (72-byte bcrypt limit)
17
18**Session:** 3
19**What happened:** Retrieval pulse Q1 asked about bcrypt's 72-byte password truncation limit. Learner answered correctly ('trimmed') but noted: 'This was never taught.' Material was never mentioned in M1 or M2 content.
20**Root cause:** Coach assumed this OWASP critical fact (plan.md mentions it) was covered in M2 conceptual material. Not explicitly taught; learner reverse-engineered it from OWASP docs or implicit knowledge.
21**Correction:** Add to M2 agenda or explicitly surface in next session: 'bcrypt has a critical limitation: passwords over 72 bytes are silently truncated. This is why Argon2id is safer (no limit).' Include in card set or callout.
22**Why it matters:** Silent truncation is a real vulnerability (password 'correct_password_x' truncates same as 'correct_password_y'). Critical for security code review. Learner should not reverse-engineer OWASP facts; they should be taught.
23**Follow-up:** Include 72-byte limit in next M2 review or M3 revisit. Create card if needed (low priority — learner clearly understands bcrypt sufficiently).
24**Source:** Session 3 retrieval pulse Q1, learner's note: 'never taught'
25**Status:** monitoring
26

It sounds simple, just two markdown files that the LLM reads from and maintains, but it's powerful as it allows it to self-improve.

Idea

I got the idea from seeing how an OpenClaw plugin implemented a skill for self-improving agents. I'd recommend checking it out as it's really cool stuff!

How to start building your own harness?

The first step is to identify points of improvement for your interactions. You fire up a session, ask the LLM to perform a task (e.g. teach me X, refactor feature Y), and then review its output. I find it useful to leave text files containing notes of my interactions, especially if the issue is something I encountered for the first time and I'm not sure if it's going to happen again.

Once you have this information, ask yourself the following questions:

What did I not like about the response and why?
What caused the LLM to generate such a response?
What can I change about the environment so that the next time I give the LLM the same question/task, it will respond better?

For questions two and three, you can ask the LLM for input and suggestions. This is similar to how self-improving agents work but in a more manual fashion.

Finally, tweak the environment to implement the fixes. This could include writing tools for the LLM to invoke, adding hooks to automatically enforce actions at specific checkpoints, or simply adding another text file for the LLM to read containing reminders on how to avoid these issues next time.

Warning

If you find yourself updating certain files frequently (e.g. in /ultralearn, coach-errors.md is updated every session), you may be tempted to offload the update to the LLM.

This can work, but avoid prompts being the only update guardrail. You can tell the LLM to follow a specific format, but it may not always follow it faithfully, especially when its context is filled with other information so the instruction ends up getting lost. The LLM may end up writing to it in an inconsistent format or even overwriting previous information!

What worked for me is creating a script that does the writing and exposing it as a tool to the LLM. The LLM calls this along with the generated input parameters and the tool writes to the file with the enforced format. This keeps your writes deterministic and consistently formatted.

Hope you learned something! Happy coding🚀

bs_code

Beyond Prompting: Harness Engineering for Learning Systems

What problem was I trying to solve?

What is harness engineering?

How did I apply harness engineering to my learning system?

What does harness engineering unlock for me?

Plateau Detection

Stronger Verification Enforcement

Revamped Weak Spot Tracking

Self-improving Tutor

How to start building your own harness?

Share

Like

Support

Related Posts