Why Self-Improving AI Agents Need an Approval Gate

If you want your agents to improve over time, the idea of enabling a self-improving system is genuinely appealing. And for good reason. When set up correctly, it allows the agent to learn from corrections, detect useful patterns, consolidate operational judgment, and flag improvements you might never have thought to formalize yourself. The problem starts when that learning stops being useful memory and begins touching the agent's behavioral core without human oversight.

That is the critical boundary.

Because there is a meaningful difference between an agent that learns and an agent that rewrites its own operational constitution on its own terms.

What the Self-Improving Layer Actually Is

A robot apprentice takes frantic notes and stops at a locked SOUL.md door requiring human authorization

In this context, self-improving is not an abstract idea or a vague promise. It exists as a standalone skill at /root/self-improving/SKILL.md and presents itself as a layer of self-reflection, self-criticism, self-learning, and self-organized memory. Its goal is for the agent to evaluate its own work, detect failures, learn from explicit corrections, and consolidate permanent improvements over time.

The proposal is powerful because it tackles one of the classic limitations of agents: repeating mistakes, forgetting preferences, and failing to accumulate genuine operational judgment between tasks.

Where It Lives and How It Is Organized

The skill lives in ~/self-improving/ and organizes its memory there across several layers. According to its own architecture, memory is structured by tiers and namespaces to avoid conflating everything. The main pieces include:

memory.md, which acts as HOT memory, small and always loaded.
corrections.md, where explicit corrections and reusable lessons are recorded.
learning.md, which defines the logic for learning, classification, and pattern promotion.
heartbeat-rules.md, which governs periodic maintenance and conservative system review.
heartbeat-state.md, for tracking the state of reviews.
Folders like projects/, domains/, and archive/, which allow patterns to be separated by project, domain, or age.

The skill's own documentation makes a reasonable point clear: not everything should enter hot memory, not all learning is global, and not every pattern deserves to become a stable rule. That is why it uses a tiered structure with HOT, WARM, and COLD memory, along with rules for promotion, demotion, and archiving.

On paper, it is well thought out.

What Real Advantages It Brings

It is worth acknowledging this plainly: the advantages are obvious.

An agent with continuous improvement can:

learn from real user corrections,
remember confirmed preferences,
identify workflows that consistently perform well,
detect operational or editorial improvements the human does not always put into words,
and accumulate judgment over time.

That has serious value. In fact, for anyone working with agents daily, having the system flag improvement opportunities on its own can save a significant amount of work. The framing is exactly right: humans often will not notice certain patterns or will not formalize them in time, and having the agent surface them is useful.

The correct position, therefore, is not "disable learning." The correct position is different: keep learning active, but do not hand it control over the agent's core.

Where the Underlying Problem Was

The problem emerged in the promotion logic for learning.

Originally, the general idea was that when a pattern repeated enough times, it could move up a level. In itself, that makes sense. If a correction appears multiple times, it is probably no longer an anomaly, it is a pattern. The risk was that this logic could become too aggressive when the final destination of the promotion was not an auxiliary note but something structural, like SOUL.md or AGENTS.md.

And that changes everything.

Because SOUL.md and AGENTS.md are not simple notepads. They are files that affect the agent's core behavior, how it responds, decides, prioritizes, and operates. Touching them without strong human validation is equivalent to allowing the system to rewrite its own editorial stance or operational framework by inertia.

Why Editing SOUL.md or AGENTS.md Is Never a Minor Change

In most deployments, those files function as the most sensitive layer of the system. They do not just contain instructions. They contain operational identity.

If an erroneous or overly localized rule ends up inside them, it stops behaving like a contextual observation and becomes a first-level directive. The agent no longer treats it as a hint. It treats it as a standing order.

That distinction is crucial.

A temporary preference, an instruction given in haste, or a reaction born of frustration should not be able to climb on their own to the agent's core. If that happens, the system is no longer learning with judgment. It is crystallizing noise.

The Clearest Example: Repeating Commands Out of Frustration

This is one of the risks worth highlighting with particular care, and it is a strong example.

Sometimes, when an agent does not respond on the first try or does not do exactly what is expected, the human loops and repeats the instruction several times. Not because they have discovered a great general rule, but because they are trying to course-correct in the moment. In that context, repetition does not necessarily mean "this should become stable doctrine." Most of the time it just means "please just do it."

If the system interprets those three repetitions as a pattern worthy of promotion, it can make a serious error: turning an impulsive order into a permanent rule. And if that rule ends up in SOUL.md, the damage is greater, because it no longer affects just that specific case, it affects how the agent operates across all future tasks.

Put differently: not every repetition is wisdom. Sometimes it is just friction.

The Second Risk: Degrading Vital but Rarely Used Functions

There is another danger that is less obvious but equally important.

A self-improvement system does not only watch what repeats. It can also reorganize, downgrade, or reinterpret things that have not appeared for a long time. On paper, this helps keep memory clean. In practice, it can become a problem when the subject is a critical but infrequent function.

Imagine an important agent capability that barely gets used for weeks or months. Not because it no longer matters, but simply because it has not been triggered during that period. If the human has also forgotten that continuous improvement is still running, that function can end up exposed to degradation from disuse, misclassification, or a loss of relative weight against more recent but less important patterns.

This is a classic failure mode in systems that over-optimize for frequency. They confuse "not appearing right now" with "no longer relevant."

In a complex agent, that is dangerous.

What the Current Mechanics Actually Say

Here is the important part: the practical solution has already been built into the system's logic.

In learning.md, the current promotion rule makes it clear that when a pattern reaches 3 repetitions in 7 days, it does not get promoted automatically. Instead, this happens:

1. A message is sent via Telegram.

2. That message explains the detected pattern, its source, and the proposed destination.

3. The system waits for an explicit response.

4. Only if the response is a clear "ok" for that specific destination does the promotion execute.

5. The pattern's state is then updated as promoted.

This completely changes the nature of the system.

We are no longer dealing with an agent that rewrites itself without asking permission. We are dealing with an agent that proposes structural learning, but does not impose it.

And that difference makes it far healthier.

The Real Practical Solution

The position here is both reasonable and useful for anyone building something similar.

The solution is not to strip the agent of its ability to learn. The solution is to put a gate before the core.

That means:

let it detect patterns,
let it suggest improvements,
let it accumulate judgment,
but before writing to SOUL.md or AGENTS.md, require explicit approval.

That is the right fix. It preserves the best of both worlds: a system that learns and proposes, with human review on every structural change.

Practical Recommendation

If you are going to enable a continuous improvement system, the recommendation is simple:

put the approval gate in from day one.

Do not wait for the agent to accumulate enough momentum to start touching sensitive things. Do not assume every repetition equals a good rule. And do not trust that you will always remember which rarely-used functions still matter.

Let the system learn. But do not let it modify the agent's constitution without first passing through human review.

Because learning is great.

Self-rewriting without asking, not so much.