Eslam HelmyEslam Helmy
11 min readEslam

Build Your Own Dev Agent — Lesson 9: Skill Self-Evaluation

Your agent has 7+ skills running on cron schedules. Some run daily, some run every 2 hours. But are they all still useful? Are any broken? Is the heartbeat report getting bloated? Without evaluation, skills accumulate like dead code -- they run, consume context, and produce output nobody reads. The skill evaluator is a meta-skill that audits every other skill in the system.


Where You Are

your-project/
  CLAUDE.md
  .claude/
    preferences.md
    tasks-active.md
    tasks-completed.md
    progress.txt
    error-log.md
    learnings.md
    auto-resolver.md
    priority-map.md
    cron-jobs.json                   # 6+ jobs
    failed-jobs.log
    settings.local.json
    hooks/
      stop-telegram.sh
      permission-gate.sh
    skills/
      daily-planner/
        SKILL.md
      pr-reviewer/
        SKILL.md
      git-reviewer/
        SKILL.md
      standup-generator/
        SKILL.md
      meeting-ingest/
        SKILL.md
      heartbeat/
        SKILL.md

See It: The Problem

Three things happen to skills over time:

1. Skills go stale. The content-creator skill runs every 5 hours, but your content pipeline has been empty for two weeks. It fires, checks pipeline.json, finds nothing, and exits. Every run wastes context and adds noise to progress.txt.

2. Skills break silently. The log-monitor skill checks Azure workspace logs every weekday morning. But your az login token expired three weeks ago. The skill fails, writes to failed-jobs.log, and nobody notices because the heartbeat only checks if the job ran -- not if it succeeded meaningfully.

3. Skills get noisy. The heartbeat started at 120 words of output. After adding 8 more skills to its verification list, it now produces 380 words per run. Every 2 hours. That is 4,560 words per day of health reports. The signal-to-noise ratio dropped.

Without a dedicated evaluation pass, you are flying blind on skill quality.

See It: The Skill Evaluator Pattern

The skill evaluator is a meta-skill. It does not do any work itself -- it audits the skills that do work. It runs nightly (3:03 AM), after all daily skills have completed, and reads everything:

| Input File | What It Reveals | |---|---| | learnings.md | Which skills triggered learning events (good or bad) | | error-log.md | Which skills caused errors | | progress.txt | Which skills ran, how often, what they produced | | failed-jobs.log | Which skills failed and how many times | | Every SKILL.md | Current skill definitions, schedules, dependencies |

The evaluator cross-references these files to build a picture of each skill's health across five dimensions.


See It: The 5 Dimensions

Each skill is scored 1-5 on five dimensions:

1. Effectiveness (Is it producing useful output?)

Does the skill produce output that gets used? A standup generator that writes standups you actually paste into Slack scores 5. A content-creator that runs on an empty pipeline scores 1.

Real example: the content-creator scored 2/5 because the pipeline had no active topics. It ran every 5 hours but produced nothing. Recommendation: add an idle-skip check -- if pipeline.json has no items in in_progress, exit without logging.

2. Approach (Is the method still right?)

Is the skill using the best approach for the job? Maybe you originally built the PR reviewer to check 3 repos, but you have moved to a monorepo. The approach is outdated.

Real example: the link-checker scored 3/5 because it was doing deep validation on every link, including internal anchors. A tiered approach -- quick check for known domains, deep check for external URLs -- cut runtime by 60%.

3. Relevance (Should this skill still exist?)

Some skills are created for a specific project or season. A "conference-prep" skill has zero relevance after the conference ends. A brother-transfer-check that runs on the 5th of every month is relevant 1 day per month and irrelevant the other 29.

Real example: the english-coach scored 5/5 on relevance because the user had active English classes. If classes ended, relevance would drop to 1.

4. Reliability (Does it run without failing?)

Check failed-jobs.log for entries matching this skill. Zero failures = 5. Occasional failures with auto-recovery = 4. Repeated failures = 2. Never ran successfully = 1.

Real example: the log-monitor scored 1/5 because az login had expired. Every run failed. The fix was adding a Step 0 capability pre-check: verify az account show succeeds before running the actual log query.

5. Efficiency (Is the cost reasonable for the value?)

Does the skill use excessive context, produce overly long output, or run more frequently than necessary? The heartbeat growing from 120 to 380 words is an efficiency problem. A skill that runs every hour but only has new data once a day is wasteful.

Real example: the heartbeat scored 3/5 on efficiency. Recommendation: cap the report to 200 words maximum, only list items that need attention, skip healthy checks.


See It: Scoring and Recommendations

Scoring scale:

| Score | Meaning | |---|---| | 5 | Excellent -- no changes needed | | 4 | Good -- minor improvements possible | | 3 | Adequate -- specific improvements recommended | | 2 | Needs work -- significant issues | | 1 | Critical -- broken, stale, or irrelevant |

Recommendation triggers: Any dimension scoring below 4 generates a recommendation. Recommendations have priority levels:

| Priority | When | |---|---| | HIGH | Any dimension scores 1-2 | | MEDIUM | Any dimension scores 3 | | LOW | Optimization suggestions (score 4 with room for improvement) |

The evaluator produces a report like this:

## Skill Evaluation Report -- 2026-04-11 03:03 AM
 
### Summary
- Skills evaluated: 14
- Average score: 3.9/5
- Recommendations: 6 (2 HIGH, 3 MEDIUM, 1 LOW)
 
### Scores
 
| Skill | Effect. | Approach | Relev. | Reliab. | Effic. | Avg |
|-------|---------|----------|--------|---------|--------|-----|
| daily-planner | 5 | 5 | 5 | 5 | 4 | 4.8 |
| heartbeat | 4 | 4 | 5 | 5 | 3 | 4.2 |
| log-monitor | 2 | 4 | 5 | 1 | 3 | 3.0 |
| content-creator | 2 | 3 | 4 | 5 | 2 | 3.2 |
 
### Recommendations
 
1. **[HIGH] log-monitor** -- Reliability 1/5. az login expired.
   Fix: Add Step 0 pre-check for az account show.
2. **[HIGH] content-creator** -- Effectiveness 2/5. Empty pipeline.
   Fix: Add idle-skip when pipeline.json has no in_progress items.
3. **[MEDIUM] heartbeat** -- Efficiency 3/5. Output grew to 380 words.
   Fix: Cap report to 200 words, skip healthy checks.

See It: Missing Skill Detection

The evaluator does not just audit existing skills. It also identifies repeated manual tasks that should become skills.

It does this by scanning:

  • tasks-active.md -- are there recurring tasks that are always present?
  • progress.txt -- are there manual actions the user performs weekly?
  • tasks-completed.md -- are the same types of tasks being completed repeatedly?

Real example: the task list grew from 5 items to 53 items over 3 weeks. The evaluator detected this pattern and proposed a task-triage skill: run weekly, prune completed items, archive stale tasks, flag items that have been active for more than 14 days.

Another example: the user manually ran "find me 3 new content ideas" every Wednesday. The evaluator proposed a topic-discovery skill to automate it.

Missing skill detection is how the agent system grows itself. The evaluator proposes, the user approves, and the new skill gets built and scheduled.


See It: The Course and Content Audit

The evaluator has one more capability: it checks if any improvements are worth sharing with others. This is the meta-meta level.

When the evaluator discovers a pattern that improved multiple skills (like Step 0 pre-checks or idle-skip), it flags it as a potential course update or blog post:

### Shareable Patterns Detected
 
- **Step 0 Pre-Check** applied to 3 skills (log-monitor, browser-verify,
  link-checker). Universal pattern. Consider adding to course.
- **Idle Skip** applied to 2 skills (content-creator, topic-discovery).
  Reduces noise. Consider adding to architecture docs.

The skill evaluator recommended adding itself to the course. That is how this lesson exists.


Build It: Skill Evaluator

Intent: Create the meta-skill that audits all other skills nightly.

Prompt for Claude Code:

Create the directory .claude/skills/skill-evaluator/ and then create
.claude/skills/skill-evaluator/SKILL.md with this content:

# Skill Evaluator -- Nightly Audit

Schedule: 3:03 AM daily

## Input

Read ALL of these files:
- .claude/learnings.md -- learning events by skill
- .claude/error-log.md -- errors by skill
- .claude/progress.txt -- skill execution history
- .claude/failed-jobs.log -- failure counts by skill
- .claude/cron-jobs.json -- current schedule and status

Read every SKILL.md file in .claude/skills/*/SKILL.md.

## Process

### Step 0: Capability Pre-Check
- Verify all state files exist and are readable
- If any critical file is missing, log to failed-jobs.log and exit

### Step 1: Collect Evidence
For each skill found in .claude/skills/:
- Count runs in the last 7 days (from progress.txt)
- Count failures in the last 7 days (from failed-jobs.log)
- Check if any errors reference this skill (from error-log.md)
- Check if any learnings reference this skill (from learnings.md)
- Read the SKILL.md to understand intent and schedule

### Step 2: Score Each Skill (1-5 on each dimension)
- Effectiveness: Is it producing output that gets used?
- Approach: Is the method still the best approach?
- Relevance: Is this skill still needed?
- Reliability: Does it run without failing?
- Efficiency: Is the context/output cost reasonable?

### Step 3: Generate Recommendations
For any dimension scoring below 4:
- HIGH priority for scores 1-2
- MEDIUM priority for score 3
- LOW priority for score 4 with clear improvement path

### Step 4: Detect Missing Skills
Scan tasks-active.md and progress.txt for:
- Recurring manual tasks (same pattern 3+ times in 14 days)
- Growing lists (task count increasing without pruning)
- Repeated user actions that follow a predictable pattern
Propose new skills for any detected patterns.

### Step 5: Check for Shareable Patterns
If any improvement was applied to 2+ skills, flag it as a
potential course/blog/architecture update.

## Output

Write report to .claude/reports/skill-eval-[date].md with:
- Summary (skills count, average score, recommendation count)
- Score table (all skills, all 5 dimensions, average)
- Recommendations list (priority, skill, issue, fix)
- Missing skill proposals (if any)
- Shareable patterns (if any)

Keep the report under 400 words. Be specific in recommendations.

## State Update

- Append to progress.txt: "[timestamp] -- Skill Evaluator:
  {skills_count} skills, avg {score}/5, {rec_count} recommendations"
- If any skill scores below 2.0 average: flag in next heartbeat
- If missing skills detected: add proposal to tasks-active.md as P3

Expected output: A complete skill-evaluator SKILL.md file.


Build It: Add Cron Entry

Intent: Schedule the skill evaluator to run nightly at 3:03 AM.

Prompt for Claude Code:

Add a new entry to .claude/cron-jobs.json:

{
  "id": "skill-evaluator",
  "skill": ".claude/skills/skill-evaluator/SKILL.md",
  "schedule": "3 3 * * *",
  "description": "Nightly audit of all skills: score, recommend, detect missing",
  "enabled": true,
  "expires": "7d",
  "last_run": null
}

Keep all existing entries.

Expected output: cron-jobs.json with the skill-evaluator entry added.


Build It: Register with Heartbeat

Intent: Add the skill-evaluator to the heartbeat's verification list.

Prompt for Claude Code:

Update .claude/skills/heartbeat/SKILL.md. Add this line to the
"Also verify these skill files exist" list:

- .claude/skills/skill-evaluator/SKILL.md

The heartbeat should now verify the skill-evaluator exists and
is healthy, just like it verifies every other skill.

Expected output: Updated heartbeat skill with the skill-evaluator in its verification list.


Checkpoint

After this lesson, your project should contain:

your-project/
  CLAUDE.md
  .claude/
    preferences.md
    tasks-active.md
    tasks-completed.md
    progress.txt
    error-log.md
    learnings.md
    auto-resolver.md
    priority-map.md
    cron-jobs.json                   # 7 jobs (added skill-evaluator)
    failed-jobs.log
    settings.local.json
    hooks/
      stop-telegram.sh
      permission-gate.sh
    skills/
      daily-planner/
        SKILL.md
      pr-reviewer/
        SKILL.md
      git-reviewer/
        SKILL.md
      standup-generator/
        SKILL.md
      meeting-ingest/
        SKILL.md
      heartbeat/
        SKILL.md                     # Updated -- now verifies skill-evaluator
      skill-evaluator/
        SKILL.md                     # NEW

The skill-evaluator runs at 3:03 AM, after all daily skills have finished. It reads everything, scores every skill on 5 dimensions, generates recommendations, and detects manual tasks that should become automated skills. The heartbeat verifies it exists. The learning loop captures any patterns it discovers. The system monitors itself.


Fork It

  • Different schedule? Run weekly instead of nightly if you have fewer than 5 skills. Nightly evaluation makes sense at 7+ skills.
  • Different dimensions? Add "User Satisfaction" if you track whether you actually read skill output. Add "Cost" if you want to track token usage per skill.
  • Auto-fix mode? For LOW priority recommendations, let the evaluator apply the fix autonomously (e.g., adding idle-skip to a skill). Require approval for MEDIUM and HIGH changes.
  • Team evaluation? If multiple people share skills, add a "Coverage" dimension -- is the skill useful for the whole team or just one person?
  • Evaluation of the evaluator? The skill evaluator can evaluate itself. If it consistently produces 0 recommendations, either your skills are perfect or the evaluator is not looking hard enough.

Next: the evaluator finds patterns. The patterns improve skills. The improved skills produce better data. The evaluator finds new patterns. This is the loop that makes the agent better without you touching it.


This is part of the Build Your Own Dev Agent course. ← Previous Lesson

Share this post