/ultralearn: Building an AI-Powered Self-Learning System with Claude Code

Trying to learn with ChatGPT

As someone who's an avid fan of learning, ChatGPT was a game-changer as I could ask it to explain any concept to me and test me on it. I would have a 30-minute conversation with it to learn about some technical topic and I felt that, by the end of it, I was an expert!

However, when the chat limit is eventually reached, I would start a new chat, which doesn't understand where I am. I have to re-explain my level and re-cover ground, providing little opportunity to build on prior sessions before hitting another limit.

The issues didn't stop here. As my sessions progressed, I ran into more failure modes. The largest were:

Statelessness - Every new chat session started from scratch.
No error tracking - The AI doesn't know what my knowledge gaps or current misconceptions are.
No verification - The AI could easily teach concepts wrong and I won't know until later, if at all!

So, I decided to build a learning system on top of Claude Code that addresses all of these issues.

The system's architecture

Here is the system at a high-level:

Ultralearning System Architecture

1/6

Entry Point

/ultralearn "topic"

User launches a learning session from the CLI

Main Session — The Coach

Socratic Tutor

The session itself acts as the coach. Drives metalearning, Socratic questioning, deliberate practice, and review. Delegates specialized work to subagents.

Delegated via Task tool

Subagent

Verification Gate

Fact-checks claims against docs

Subagent

Assessment Agent

Generates calibrated questions

Subagent

Reference Clerk

Produces deep-dive explanations

Persistence

Artifact Clerk

Saves all learning artifacts. Manages versioning via git. Ensures nothing is lost between sessions.

Artifact

Knowledge Map

Concept mastery with 5-level status ladder

Artifact

Misconceptions

Error tracking with root cause analysis

Artifact

Flashcards

Generation-based Q&A with SM-2 metadata

Artifact

Journal

Session notes, savepoints, and reflections

Scheduling Engine

SRS Engine — SM-2

Calculates optimal review intervals. Tracks easiness factor, interval, repetitions, and lapses per card.

Continuity

Next Session — Resume

Loads savepoint, identifies due cards, greets with full context. No re-explaining.

You start a session with /ultralearn "topic". The skill activates and the main session itself becomes the Socratic coach — there is no separate coach agent.

As you can see, the system is composed of a main session, which I call the Coach, call multiple agents working together when a /ultralearn command is invoked. These agents are:

Verification Gate - An agent that makes sure the claims being covered by the coach are grounded in truth.
Artifact Clerk - An agent that saves the structured learning data through a bundle of artifacts (more on these later!).
Assessment Agent - A lightweight agent that coordinates the assessments conducted in the sessions.
Reference Clerk - An agent that generates reference documents that cover the concepts discussed in a session for review.

The typical flow of a session is:

User invokes session via /ultralearn <topic>
Coach checks for existing sessions via learning artifacts. If none exist, a new session is started. Otherwise, the coach picks up where the session left off.
Verification Gate fact-checks the topics the coach will bring up in the current session before teaching begins.
Coach runs the session.
Once the session is done, Artifact Clerk saves the structured learning data.

Walking through an ultralearning session

Now that you have a high-level understanding of how the system works, here's how it looks in action from a topic I'm currently learning - Retrieval Augmented Generation Pipelines.

Starting an ultralearning project

We kick off a Claude Code session and I fire off a /jv:ultralearn command, feeding it the topic I want to learn. In this case, RAG pipelines! The first thing that happens on a new project is that we're asked a set of questions that aim to identify:

What our current understanding of the topic is
What success looks like
Where do we plan to use this information
What do we expect to be the hardest thing about learning the topic

You can see my answers to these questions below.

Initializing an ultralearning project

Once this context is provided, the Coach then builds a learning plan for the project. It calls the Verification Gate agent to ensure that the learning plan is grounded on actual information.

Ultralearning coach generating a grounded curriculum

Coach planning curriculum verified by Verification Gate

Once the learning plan is generated, we're off to the races!

learning-plan.md

1# RAG Pipelines — Ultralearning Plan
2
3> **Goal:** Build production-grade RAG systems from scratch
4> **Learner profile:** Strong RAG evaluation background (RAG Triad, RAGAS, DeepEval, retrieval metrics all Solid+). Has read about pipeline components but never built end-to-end. Self-identified weak spot: chunking and vector storage.
5> **Created:** 2026-03-25
6
7## Prior Knowledge (from cross-references)
8
9The following concepts are **already covered** in sibling projects and will NOT be re-taught:
10- RAG Triad (context relevance, groundedness, answer relevance) — Solid (rag-triad)
11- RAGAS framework — Solid (rag-triad)
12- DeepEval framework — Developing (rag-triad)
13- Retrieval metrics (Precision@k, Recall@k, MRR, MAP, NDCG) — Solid (rag-retrieval-metrics)
14- LLM-as-judge methodology — Solid (rag-triad)
15- CI/CD evaluation gates — Solid (rag-retrieval-metrics)
16- Statistical testing for eval (paired t-test, McNemar, power analysis) — Solid (rag-retrieval-metrics)
17- Evaluation dataset construction — Solid (rag-retrieval-metrics)
18
19These will be **referenced** when connecting pipeline stages to evaluation, not re-taught.
20
21## Metalearning Map
22
23### Core Concepts (must know)
24- Document loading and preprocessing
25- Chunking strategies (fixed-size, recursive, semantic, structure-aware)
26- Embedding models (selection, trade-offs, dimensionality)
27- Vector stores (indexing, search, metadata filtering)
28- Retrieval strategies (dense, sparse, hybrid)
29- Prompt construction (context assembly, system prompts)
30- Generation with retrieved context
31
32### Intermediate Concepts (build competence)
33- Chunk overlap and boundary handling
34- Embedding model benchmarks (MTEB) and selection criteria
35- Vector index types (HNSW, IVF, flat) and their trade-offs
36- Hybrid retrieval with Reciprocal Rank Fusion (RRF)
37- Reranking with cross-encoders
38- Metadata filtering strategies
39- Parent-child chunk relationships (small-to-big retrieval)
40
41### Advanced Concepts (production-grade)
42- Query transformation (decomposition, HyDE, step-back prompting)
43- Agentic RAG (tool-using retrieval, routing, self-correction)
44- Caching layers (semantic cache, exact-match cache)
45- Guardrails (input validation, output filtering, hallucination detection)
46- Observability and tracing (per-stage latency, retrieval quality monitoring)
47- Pipeline evaluation integration (connecting to your existing eval knowledge)
48
49## Skill Tree (dependencies)
50
51```
52Level 1: Foundations
53  Document Loading ──┐
54                     ├──> Chunking Strategies ──> Embedding ──> Vector Store Indexing
55  Text Preprocessing ┘
56
57Level 2: Core Pipeline
58  Vector Store Indexing ──> Dense Retrieval ──> Context Assembly ──> Generation
59                                  │
60                           Sparse Retrieval (BM25) ──> Hybrid Retrieval (RRF)
61
62Level 3: Optimization
63  Hybrid Retrieval ──> Reranking (cross-encoders)
64  Chunking ──> Parent-Child Chunks (small-to-big)
65  Retrieval ──> Metadata Filtering
66
67Level 4: Advanced Retrieval
68  Core Pipeline ──> Query Transformation (decomposition, HyDE)
69  Core Pipeline ──> Agentic RAG (routing, self-correction)
70
71Level 5: Production Hardening
72  Full Pipeline ──> Caching ──┐
73  Full Pipeline ──> Guardrails ├──> Production-Grade System
74  Full Pipeline ──> Observability ┘
75  Prior Knowledge (eval) ──> Pipeline Evaluation Integration
76```
77
78## Learning Path with Milestones
79
80### Milestone 1: First Working Pipeline (Sessions 1-3)
81- **S1:** Document loading, chunking strategies (deep dive — your weak spot)
82- **S2:** Embeddings and vector store fundamentals (deep dive — your weak spot)
83- **S3:** End-to-end pipeline: load → chunk → embed → store → retrieve → generate
84
85### Milestone 2: Retrieval Optimization (Sessions 4-6)
86- **S4:** Hybrid retrieval (BM25 + dense + RRF)
87- **S5:** Reranking and parent-child chunks
88- **S6:** Metadata filtering, query transformation (HyDE, decomposition)
89
90### Milestone 3: Production Hardening (Sessions 7-9)
91- **S7:** Agentic RAG patterns (routing, self-correction loops)
92- **S8:** Caching, guardrails, error handling
93- **S9:** Observability, evaluation integration (leverage your existing eval expertise)
94
95### Milestone 4: Capstone (Session 10)
96- **S10:** Build a complete production-grade RAG system integrating all concepts
97
98## Technique Priorities
99
100| Technique | Priority | Application |
101|---|---|---|
102| **Deliberate practice** | Highest | Build real pipelines at every session — no passive reading |
103| **Retrieval practice** | High | Recall pipeline decisions, trade-offs, and config choices from memory |
104| **Interleaving** | High | Mix chunking/embedding/retrieval decisions (they're interdependent) |
105| **Elaborative interrogation** | Medium | "Why this chunk size? Why this index type? What breaks if we change X?" |
106| **Spaced repetition** | Medium | Flashcards for model names, config defaults, trade-off heuristics |
107
108## Session Structure
109
110- **Deep sessions (60-90 min):** New concepts + hands-on building (Sessions 1-10)
111- **Spaced review (15-30 min):** SRS-driven card review + retrieval warm-up (between deep sessions)
112- **Recommended pace:** 2-3 deep sessions per week with spaced reviews between
113

Note

Notice that the Coach is able to automatically detect topics I already know, partially informing it of my current understanding of the topic. This comes from a central registry that logs the concepts I learned across all my ultralearning projects, preventing the new project from re-teaching things I already know.

Moving through the first session

The first session covers document loading and chunking strategies. However, notice how the concepts are communicated. Instead of doing a knowledge dump explaining the concepts, I'm immediately asked a question. This is a form of elaborative interrogation, which involves me generating an explanation behind why a concept is true instead of just accepting it at face value.

For example, in the screenshot below you can see that question 2 is attempting to communicate what chunking is. However, instead of just explaining what it is, it asks me why we need it in the first place. By prompting me to come up with an answer, I have to do more than just take the concept at face value. Even if I'm wrong, the attempt will prime my brain to absorb the answer when I learn what it is.

Starting your first ultralearning session

Starting first session

You can see that the session is just this back-and-forth between the LLM questioning me and me attempting to generate answers, followed by a reveal of the actual answer.

Continuing your first ultralearning session

First session progression

What happens when I get something wrong? Take a look at the screenshot below. I'm asked What kind of document has no meaningful structural boundaries to exploit?, to which I answer "code" (skill issue, I know). However, the coach doesn't just say I'm wrong. It provides feedback to correct my mental model, so that I won't repeat it next time.

Coach catching misconception on meaningful structural boundaries

The same thing happens when I straight up say I can't recall the answer to something. Take a look at the question State the three conditions that must ALL be true to justify choosing semantic chunking over recursive character splitting.

When I tell the Coach I can't recall, it provides the same actionable, timely feedback.

Coach catches retrieval failure and guides towards answer

Wrapping up the first session

When the session ends, I communicate that I want to wrap up. Once this is done, a few things happen:

Flashcard generation - The Artifact Clerk is invoked to generate flash cards based on the topics covered. These will be reviewed next session. This is very similar to Anki's workflow. The Verification Gate agent is called again to ensure that the answers to the flashcards are grounded in fact.
Learning artifact generation - The Artifact Clerk also generates a number of learning artifacts to save the current session's state. These include:

- plan.md - The full learning plan. This is essentially the curriculum
- journal/session-01.md - A file that documents what happened in the current session
- knowledge-map.md - Takes the concepts from the learning plan and identifies my level of understanding in each.
- cards.md - The flashcards.
- misconceptions.md - The misconceptions that the Coach identified so that it can be drilled in the next sessions.
- cards.srs.json - Metadata for the flashcards used to coordinate spaced retrieval.

Wrapping up the first ultralearning session

Through these artifacts, a new session can pick up exactly where we left off! You can close the Claude Code session here, come back tomorrow, and keep going. No more explaining to the LLM what I already know. Let's see what the next session looks like.

The next session

The session is kicked off via /jv:ultralearn continue session, which will find the RAG pipeline learning project and use the learning artifacts to gain context on my progress. After this, the session starts off with retrieval practice via flashcards.

The Assessment Agent is called to handle the review session. The flow is similar to our first session. It's a back-and-forth between the agent asking me questions and me answering them.

Starting the next ultralearning session

Note

I didn't include the Assessment Agent in the architecture diagram since it's just a lightweight coordinator. This can technically be handled by the Coach. Separating it was personal preference.

Once the cards are tackled, the agent will grade each card based on the quality of your answers. This grade is used to inform when the card will be up for review again. You can see in the screenshot below the grades I got. The agent also provides a summary of my review performance, highlighting weak spots that are to be drilled in a future session.

Graded review for an ultealearning project

Getting grades for your review session

Once the review is done, you can either move on to the next topic or end the session. For this demo, we'll call it a day. This new session is saved, going through the same process as the first session.

Wrapping up review session

The design decisions that mattered

Focus on deliberate difficulties

This controls the flow of each session. The learning techniques employed in the workflow use techniques that introduce deliberate difficulties. These make the learning feel hard in the short-term but have shown to result in longer term retention of information. This includes techniques like retrieval practice, spaced repetition, and interleaving. Without it, the system can just dump explanations or will lean towards passive techniques that don't involve the learner, making the concepts ephemeral.

Verification gate

I would argue that this is the most crucial piece of the system. The Coach generates explanations, but it's the Verification Gate that checks them against sources before they reach you.

The whole purpose of this workflow is to cement information in your head. Your brain will cement whatever you learn, including wrong things that can be propagated by an unchecked Coach. An unverified tutor is worse than no tutor.

Misconceptions as first-class data

Each misconception or error raised isn't just treated with a "you got it wrong". Instead, each gets logged as they happen, including information like:

The root cause
The correction
The linked flashcards to close the gap

These errors are crucial because they highlight where your mental model breaks. Tackling them turns these mistakes into the most valuable part of the session. Here's a sample artifact.

misconceptions.md

1# Misconceptions — RAG Pipelines
2
3> Last updated: 2026-03-25
4
5## M1 — Compute Cost Differentiates Fixed-Size from Recursive Splitting
6**Session:** 1
7**What happened:** During discrimination question "when would you choose fixed-size over recursive?", learner said: "if it's computationally expensive to chunk with a smarter strategy."
8**Root cause:** Conflated "recursive splitting is computationally more expensive" (false) with the actual cost differentiator (semantic chunking requires embeddings). Recursive character splitting is very cheap (string operations on delimiters, no ML).
9**Correction:** The real differentiator is whether document structure correlates with meaning. Choose fixed-size when structure is irrelevant (DNA sequences, logs, raw data dumps). Choose recursive when structure is meaningful (articles, markdown, code). Compute cost is never the reason.
10**Why it matters:** Wrong reasoning leads to wrong chunking strategy choice in production. This was the key discrimination failure in this session.
11**Follow-up:** — (corrected in-session)
12**Source:** — (from Socratic dialogue in session)
13**Cards:** card-4 (discrimination question)
14**Status:** remediated
15---
16
17## M2 — Semantic Chunking Heuristic Too Vague
18**Session:** 1
19**What happened:** End-of-session retrieval — learner said "when there are clear ideas but at different locations." This definition applies to almost any document.
20**Root cause:** Learner has the mechanism correct (embed sentences, split on similarity drops) but the decision heuristic is under-specified. Missing the cost component and the "no structural markers exist" precondition.
21**Correction:** Semantic chunking is chosen when: 1) The document has no structural markers (no headers, consistent formatting), AND 2) Meaningful topic shifts exist that only embeddings can detect, AND 3) The quality gain from semantic boundaries justifies the embedding cost (sentences × latency + cost). This is much narrower than "when ideas are at different locations."
22**Why it matters:** Overgeneralized heuristic could lead to using semantic chunking on documents where recursive (much cheaper) would work fine. Cost awareness is critical for production pipelines.
23**Follow-up:** S2 retrieval verification showed cost framing is still incorrect ("small dataset" vs "quality justifies cost"). Needs worked example connecting quality ROI to document-specific factors.
24**Source:** — (from end-of-session recall in session 1; S2 Q2 confirmation)
25**Cards:** card-6 (discrimination question about semantic chunking) — quality 2 grade in S2
26**Status:** recurring
27---
28

Savepoints for multi-session continuity

Every session ends with a structured savepoint, recording details like:

Where you stopped
What's in progress
What's to be tackled in the next session

Learning is not instant. It happens over the span of weeks, which is more time than a single session can run for! Without saving your progress, "resuming" is the same thing as "starting over".

Cross-project knowledge reuse

A central registry is used to track concepts across all your learning projects. If you already mastered "writing unit tests" in a project covering writing testable code, the system won't reteach it when you begin a new project on an adjacent field (e.g. "CI/CD best practices", which may cover unit tests for automation).

The biggest waste in self-directed learning is going over ground you've already covered. This registry ensures that doesn't happen.

What could be improved

I drive this system daily, so I'm constantly making incremental improvements as I run into points of friction. However, there are a couple of persistent problems

The Verification Gate isn't perfect

This agent is what keeps the Coach grounded. Ideally, this should be invoked each time the Coach makes a claim, but this is expensive, as web searches are costly, and adds latency.

My compromise was to set up the Coach to call the Verification gate at the start and end of each session. For the start, the claims that the Coach plans to bring up are all verified in one go. For the end, the flashcards to be generated from the session are fact-checked to ensure their answers are grounded. I would say this covers a lot of ground already, but there are edge cases. For example, if I ask for a clarification on a topic and it covers something that isn't in the learning plan, the Coach answers without the Verification Gate.

It can't teach everything

As much as I'd love to use this system to learn anything and everything, that's impractical. I find myself fortunate enough to be learning topics that have the following characteristics:

There is a wide corpus of information available online - It's easy to search up facts surrounding these topics. For example, SEO was one of my ultralearning projects. This is a topic that's very easy to learn about online. However, if it was some niche field where information about it online is few and far between, the system will struggle because it has little information to work with.
There is a standard approach to training - It's easy to identify what it takes to become proficient at a specific skill. Again, with SEO, there are many resources online that discuss how to gain proficiency, which guides the system in figuring out how to make the learner proficient at it. If a field had no standard approach to training, the system can't adapt it to the project. This article goes into this issue extremely well.
Input is text-based - It's feasible to communicate your capability through text. With SEO, or coding in general, it's easy to communicate with the agent through text. Explain how something works? Write an explanation. Learning how to implement a certain integration? Write out the code and have the agent validate it. However, with domains where text can't be the primary mode of input is where the system begins to fail. Learning how to draw is something that's been on my bucket list for a while, but I don't see it working with the system as it is. Although Claude Code can accept images, can it truly understand what the image is depicting? How does it know what feedback to give to help me improve?

These are the problems that I'm actively tackling and hope to find solutions for. This system is not a silver bullet, but when it works well, it works really well.

The takeaway

For as long as I can remember, I've had a huge passion for learning and believe that being able to self-learn is a skill that should be taught more. With the advent of AI tools, self-learning is more attainable now than it has ever been. You essentially get access to an infinitely patient tutor that lets you learn at your own pace, but the main issue with off-the-shelf solutions is that these tutors are stateless and ungrounded. Having to start a new session is essentially getting a new tutor. It doesn't know what you've covered, where you're struggling, and can end up teaching the wrong things as a result.

However, the tools to make AI-assisted learning stateful, structured, and grounded already exist! This gap between chatting with AI and actually learning with it is mostly an engineering problem, and this agent system with Claude Code makes that problem solvable.

Happy coding🚀

bs_code

/ultralearn: Building an AI-Powered Self-Learning System with Claude Code

Trying to learn with ChatGPT

The system's architecture

Walking through an ultralearning session

Starting an ultralearning project

Moving through the first session

Wrapping up the first session

The next session

The design decisions that mattered

Focus on deliberate difficulties

Verification gate

Misconceptions as first-class data

Savepoints for multi-session continuity

Cross-project knowledge reuse

What could be improved

The Verification Gate isn't perfect

It can't teach everything

The takeaway

Share

Like

Support

Related Posts