The maintenance problem nobody talks
about when they sell you on AI agents

Static instructions in a world that keeps changing.

March 2026

Every team using AI agents has a version of the same file. Maybe it's called SKILL.md, maybe CLAUDE.md, maybe a prompt template stored in a database. Whatever the name, it contains the instructions that tell an agent how to do a specific task: deploy a service, write a migration, triage a bug report.

These files work well on the day you write them. They encode your current codebase structure, your current tool versions, your current conventions. The problem is that everything around them keeps moving. The codebase changes. The model gets updated. Users start asking for things you didn't anticipate. And the skill file sits there, frozen in time, slowly becoming wrong.

We've seen this pattern across every project we run. A skill that worked perfectly in January starts failing quietly by March. Not dramatically. Not with errors. It just produces slightly worse output, takes slightly longer, needs slightly more human correction. The degradation is invisible until someone notices the agent isn't pulling its weight anymore.

Why skills degrade silently

The core issue is a mismatch between two rates of change. Your environment changes continuously: new commits land every day, dependencies get updated, API contracts shift, team conventions evolve. But your skill definitions change only when a human remembers to update them. And humans forget, because the skill still appears to work.

This is different from code breaking. When code breaks, you get an error. When a skill degrades, you get output that looks plausible but is subtly off. The agent uses a directory structure that was renamed two weeks ago. It follows a convention that the team has since abandoned. It calls an API endpoint that now expects different parameters.

The result: humans start compensating. They manually correct the agent's output. They add extra review steps. They lose trust in the skill and start doing the work themselves. The whole point of the skill was to save time, and now it's costing time instead.

Codebase changed, skill didn’tNew files, renamed modules, updated patterns

Model update shifted behaviorSame prompt, different interpretation

User tasks evolved over timeWhat people ask for today isn’t what they asked last month

Tool APIs were updatedEndpoints changed, parameters renamed, auth flows moved

Edge cases accumulated silentlyRare failures that compound into frequent ones

According to LangChain's State of Agent Engineering report, 57% of organizations now have agents in production. But 32% cite quality as their top barrier. That quality problem isn't about the model being dumb. It's about the instructions being stale.

The five step improvement loop

The fix isn't to write better skills once. It's to build a system where skills improve themselves continuously. We use a five step loop: Observe, Inspect, Amend, Evaluate, Update. Each step feeds the next, and the cycle repeats after every batch of executions.

1. Observe: log every execution

Every time a skill runs, record what happened. Not just success or failure, but the full context: what the input was, what the agent produced, how long it took, whether a human corrected the output, and what that correction looked like.

This is the raw material for everything else. Without observation data, you're guessing about what's going wrong. With it, you have evidence.

In practice, this means wrapping your skill execution in a thin logging layer. Capture the prompt that was sent, the completion that came back, and any post-execution edits. Store it in a structured format so you can query it later.

2. Inspect: find patterns in failures

Look at the logs periodically. Not every individual execution, but the aggregate. What percentage of runs needed human correction? Are the corrections clustered around specific inputs or specific parts of the output? Did the failure rate change after a particular date?

Pattern detection is where you separate signal from noise. A single failure might be a fluke. Ten failures with the same correction suggest a systemic issue. Maybe the agent keeps importing from a module that was refactored. Maybe it uses a deprecated API pattern. Maybe it generates test files in the wrong directory.

You can do this manually by reviewing logs once a week, or you can automate it by having an agent analyze the execution history. Either way, the goal is a short list of specific, recurring problems.

3. Amend: propose instruction changes

For each pattern you identified, draft a change to the skill instructions. Be specific. Don't add vague guidance like “be more careful with imports.” Instead, add a concrete rule: “Import database utilities from src/lib/db, not src/utils/database. The latter was removed in February 2026.”

This is where the SCOPE paper's research is relevant. Their work on prompt evolution showed that iterative, targeted amendments to instructions improved task success rates from 14.23% to 38.64%. The key wasn't rewriting everything from scratch. It was making precise, small changes based on observed failure patterns.

Keep amendments minimal. Each change should address exactly one observed problem. This makes it easy to evaluate whether the change helped and easy to rollback if it didn't.

4. Evaluate: measure improvement

Before deploying an amended skill, test it. Run the updated instructions against the same inputs that previously failed. Compare the output to the corrected versions from your logs. Did the amendment fix the problem without introducing new ones?

This is the step most teams skip, and it's the one that prevents regressions. A change that fixes import paths might accidentally break something else if the new instructions are too broad. Testing catches that.

The evaluation doesn't need to be elaborate. A handful of representative test cases is enough. The goal isn't exhaustive coverage. It's a sanity check that the amendment makes things better, not worse.

5. Update: deploy or rollback

If the evaluation passes, deploy the updated skill. If it doesn't, discard the amendment and investigate further. Keep the previous version available so you can rollback quickly if the new version causes problems in production.

Version your skill files. This is the same principle as versioning code: you want a history of what changed, when, and why. When something breaks, the diff tells you exactly what to undo.

The OpenAI cookbook recommends a three layer retraining framework with versioned prompts and automatic rollback. The principle is the same: treat prompt changes like code changes, with version control and a deployment pipeline.

Building the observation layer

The entire loop depends on good observation data. Here's what we capture for every skill execution:

Input context: the task description, relevant files, and any parameters passed to the skill
Full prompt: the assembled prompt after template expansion, including any dynamic context
Raw output: exactly what the model returned, before any post-processing
Human corrections: any edits the human made to the output before accepting it
Execution metadata: timestamp, model version, latency, token counts
Outcome signal: did the human accept, reject, or modify the output?

The most valuable field is human corrections. Every time a person edits agent output, they're providing a training signal: “Here is what you should have produced.” Aggregating these corrections reveals exactly where the skill instructions are falling short.

Store this data in a simple, queryable format. A database table works. A structured JSON log works. The format matters less than consistency. Every execution, every correction, every outcome.

From observation to amendment

Once you have a few weeks of execution logs, the inspection step becomes straightforward. Group corrections by type. Look for clusters. The most common corrections point directly to the most impactful amendments.

Here's an example from a real project. We had a skill that generated API endpoint implementations. Over two weeks, we noticed that 40% of the time, the human corrected the error handling pattern. The agent was using try/catch blocks with generic error messages. The team convention was to use a custom error class with structured error codes.

The fix was one line added to the skill instructions: “Always use AppError from src/lib/errors with a specific error code. Never use generic Error or unstructured error messages.” After that amendment, the correction rate for error handling dropped to under 5%.

This is the kind of precise, targeted change that compounds. Each amendment removes one source of friction. Over months, the skill accumulates dozens of these micro-corrections and becomes dramatically more accurate than the original version.

Verification and rollback

Every amendment carries risk. A change that fixes one failure pattern might introduce another. That's why the evaluate step exists, and why versioning matters.

We keep every version of every skill file in version control. When an amendment is proposed, it goes through the same process as a code change: create a branch, make the edit, run evaluation, merge if it passes. If a deployed amendment causes problems, we revert to the previous version immediately.

The evaluation itself is simple. We maintain a small set of “golden” test cases for each skill: inputs where we know exactly what good output looks like. Run the amended skill against these cases. If all golden cases pass, the amendment is safe. If any regress, investigate before deploying.

This approach is inspired by the EvoAgentX project, an open-source framework for self-evolving agent systems. Their core principle is the same: treat every prompt change as a hypothesis, test it before deploying, and keep the ability to rollback.

The compounding effect

A skill that improves by 5% per month doesn't seem like much. But compounding works in your favor. After six months, that skill is roughly 34% better. After a year, it's nearly 80% better. And because each improvement reduces the amount of human correction needed, the team reclaims time that compounds too.

Compare this to the alternative: a static skill that degrades by a few percent per month as the environment changes. After six months, the team is doing significant manual work to compensate. After a year, someone rewrites the skill from scratch, losing all the accumulated context.

The self-improving skill avoids the rewrite cycle entirely. It evolves with the codebase, the model, and the team's needs. The knowledge about “how to do this task well” accumulates instead of decaying.

This is the same principle behind the SCOPE paper's results. Their prompt evolution experiments didn't find one perfect prompt. They found that iterative refinement, guided by execution data, reliably converges on better instructions over time. The improvement isn't a one-time jump. It's a steady climb.

How we implement this at Buildway

We run this loop across every project we manage. Here's the practical setup:

Skills live in the repo. Each project has a directory of skill files, version controlled alongside the code they reference. When the code changes, the skills are right there to update.
Execution logs go to a shared store. Every agent session records its inputs, outputs, and corrections. We query this weekly to look for degradation patterns.
Amendments are pull requests. When we identify a needed change, it goes through code review. The reviewer checks that the amendment is specific, targeted, and backed by data from the logs.
Golden tests gate deployment. Each skill has a handful of test cases. The amended skill must pass all of them before merging.
Rollback is one revert away. Because amendments are individual commits, reverting a bad change takes seconds.

The overhead of this process is small. The weekly inspection takes about 30 minutes per project. Writing an amendment takes 5 to 10 minutes. Running golden tests takes under a minute. In return, our skills stay accurate and the team trusts them enough to use them for real work.

Skills that evolve with your product

The best skill file isn't the one that's written perfectly on day one. It's the one that gets a little better every week. The difference between a team that maintains static skills and a team that runs a self-improvement loop is the difference between a tool that slowly rusts and a tool that sharpens itself with use.

The technology for this exists today. You don't need a specialized framework. You need logging, a review process, versioned files, and the discipline to look at the data regularly. The five step loop is simple enough to start this week and powerful enough to transform how your agents perform over the next quarter.

Your codebase will keep changing. Your models will keep updating. Your users will keep asking for new things. The question is whether your agent skills will change with them, or whether they'll sit frozen in time, slowly drifting from the reality they're supposed to operate in.

Build the loop. Start observing. Let your skills earn their name.