Agent Skills Series (4): Save three rounds—freeze what you keep correcting the model into a Skill
📌 About this series
If you already agree that “conventions shouldn’t be re-explained every turn” (part 1), and you know how to write and install SKILL.md (parts 2 and 3), the next question is: what content is worth turning into a Skill.
💸 Why this ties straight to “saving tokens”
When you work with a model, the hidden cost is often turns, not the length of a single answer: you correct once, it half-fixes, you add constraints, it drifts again—each round reloads the same preferences and the same team rules, and tokens snowball.
A Skill moves stable, reusable information that should ideally take effect in round one into what the agent can treat as known state. A good Skill has a clear goal: one fewer “no—that’s not how we do it here.”
Example: without a Skill, round one might be npm install && axios.get(...); with a Skill, round one is pnpm install && import { request } from '@/api/client'—you drop the next two or three correction turns entirely.
Below are practical “should this be a Skill?” criteria you can apply by checklist.
✅ Gate first: all three, then commit a Skill slot
| Criterion | Meaning | Tiny example |
|---|---|---|
| Repeats | In the last two weeks you said the same thing to the model more than three times across different tasks | Every new page: “use useRequest, not axios.” |
| Stable | Unlikely to churn every week for the next six months (what changes belongs in a version number or CHANGELOG, not oral chat) | “JWT in an httpOnly cookie” vs “this sprint’s goal.” |
| Executable | Can be written as checklists, bans, or copy-pastable snippets the model can follow directly | “Must have tests” is vague; “*.test.ts next to the source file, vitest” is executable. |
If only one is true, it probably belongs in docs, issue templates, or CI. All three justify occupying a Skill slot.
Before you write, list 2–3 concrete scenarios in one line each: how the user asks and what “done” looks like. Keep the root body tight; park big examples and reference material under references/, and in the root file only say when to open which file. Many Skills coexist—write description narrowly and specifically; prefer several small Skills over one broad Skill whose triggers collide.
🎓 Crosswalk: DeepLearning.AI “Agent Skills with Anthropic”
The short course on DeepLearning.AI (about an hour, beginner-friendly) has several lines you can map directly to “what deserves a Skill.” Here we only keep what matters strongly for topic choice and execution.
📚 Three buckets the course uses to “place” Skills
- Domain expertise: brand guidelines, legal workflows, analytics definitions—team consensus on “how we do it,” often living in docs and people’s heads.
- Repeatable workflows: weekly reviews, customer prep, quarterly retros—fixed steps you write once and reuse.
- New capabilities: via
scripts/, templates, Office assets, etc.—so the model can reliably produce spreadsheets, documents, PDFs, slides, and other deliverables that otherwise drift.
If a pack of content also satisfies repeat / stable / executable from the table above, it is an especially good candidate to freeze as a Skill.
🧩 “Standardized domain cognition” and progressive loading
A useful summary from the course: a Skill is standardized domain cognition—also a digital asset you can reuse across teams and projects: experience is not tied to one chat thread, but to a directory you can version and cite. That is the same claim as “fewer correction turns” at the top.
Mechanically, progressive loading (metadata for each Skill at startup, full SKILL.md when a task matches, references/ and scripts only when needed) is an advantage over “stuff every rule into system prompt or paste the same wall of text every message”: less context, less confusion—the more conventions you have, the wider the gap.
🎨 Three design-forward anchors (stack with “8. Design specs and handoff” below)
- Design QA Skill: a pre-ship “what did we miss” checklist; turn a senior colleague’s mental checklist into clauses, each with bad / good examples where possible.
descriptionmust spell out capability, scenario, and typical user phrases, or the model will not select it at catalog time. YAML and how to split the body live in Concepts in practice. - Component / system Skill: tokens, grid, component inventory, page skeletons—long material in
references/; the root file only states selection boundaries and when to read which file. With multiple maintainers, use Git and explicit owners so rules do not fork. If the front end exposes the library via MCP as a queryable interface, the Skill can say how to connect, reducing guessed props and variants. - Business context Skill: personas, main flows, terms, line-of-business differences—do not try to cover everything in v1; start with the highest-frequency chunks that most often make outputs detach from reality.
🚀 How to start: install official packs, then write your own
Anthropic’s official repository already ships many ready-to-install Skills. For office formats, packs like xlsx / docx / pdf / pptx noticeably improve how the model handles those file types once installed. For custom Skills, you can use Anthropic’s Skill Creator to scaffold and debug locally. Ship a concise first version (about 50–100 lines), watch real usage for a few days, find where execution falls short, and iterate three to five rounds. When drafting, you can also have an AI produce a first pass from your requirements, then you focus on tightening triggers and bans.
🎯 Especially good Skill material (by scenario)
1️⃣ Team defaults—the kind that go wrong if you say nothing
General, cross-repo conventions (same rules for greenfield and legacy) fit Skills well: change the Skill once and it applies broadly; if the same rules live in every repo’s README, they drift and versions fall out of sync.
Typical content and examples:
- Package manager, Node version, script commands. Example: the Skill states “pnpm only; install with
pnpm install; nevernpm install,” so the model does not emitnpm i lodash. - Directory conventions. Example: “pages only under
src/views/; HTTP only viaimportfrom@/api/client,” so the model does not drop new pages underpages/or callfetchat random. - Error handling. Example: “business errors throw
AppError; a global filter turns them into JSON; never swallow exceptions in controllers withconsole.log.” - Tests. Example: “unit tests live next to sources:
foo.ts↔foo.test.ts; vitest; mocks under__mocks__/.”
Where models slip: they default to generic stacks (axios, common layouts) that do not match your wrappers—those mistakes cost you every turn. They are best encoded once in a Skill.
2️⃣ High-frequency scaffolds—the kind that cost the most words to describe orally
Typical content and examples:
- New API skeleton. Example: the Skill states “list endpoints use
GET /resources?cursor=&limit=; responses wrap{ data, next_cursor }; headers carryX-Request-Id,” so the user can say “add an order list API” and still align. - New page skeleton. Example: “list pages default to
<PageLayout>+<ProTable>+useListQuery; empty state uses<EmptyState type="list" />,” so the model does not start from a bare<table>. - PR / migration checklists. Example: “the PR description must cover: schema change? need
pnpm migrate? mobile impact?”—the model fills the template after edits and you skip a round of “please add PR notes.”
Value: the user only says “add a user list page”; the Skill fills in what “a list page here” looks like, instead of pulling from a blank component.
3️⃣ Backend and microservices—use Skills for an architecture map and stubs
Vibe-coding tools often make front ends look better for two reasons: public training has tons of single-repo SPA and component-library examples, and front-end code tends to live in one repo and one UI layer, so context is easier to “see at a glance.” Backends are often many services, many repos, many configs, many middleware pieces: who sits behind the gateway, who is called synchronously, what the async topics are called, which table belongs to which team, how staging differs from prod—none of that is in the model’s default assumptions, so you get plausible three-layer CRUD that looks right and fails at runtime.
Worth splitting into one or a few narrow backend Skills:
- Service inventory and boundaries. Example: a table for
order-service/payment-service/inventory-service: repo paths, public path prefixes, forbidden direct payment-table access from the order service. - Call graph and stubs. Example: in dev, payment calls
https://pay-stub.corp.internal(Wiremock or your own stub); never hit the real acquiring gateway locally; when joint debugging fails, first check whether prod was hit by mistake. - Database and schema ownership. Example: orders only read/write the
ordersschema; amount hold state is authoritative in thepaymentservice—the order service must notJOIN payment.transactionsacross services. - Messaging and async. Example:
OrderPaidis emitted by payment and consumed by order; topic names, retry behavior, and idempotency field names are fixed so the model does not invent topics. - Config layers. Example: feature flags live in ConfigMap
feature-xxx; secrets only in Vault / env injection, not in the repo; no plaintext password placeholders in each service’sapplication.yml.
Style tip: prefer a small C4-style map / dependency table + bans over a wall of architecture PNGs (hard for the model to read, hard to control triggers). When architecture moves, bump the Skill version or CHANGELOG—a stale map hurts worse than no map.
4️⃣ Pitfall lists—the kind human colleagues rehearse in onboarding
Good as short “if X, check Y” tables; each row can carry one counterexample line:
- Async and races. Example: ban “
awaiting an API insidewatchwith no reentrancy guard”; recommend “watch+ monotonicrequestId, drop stale responses.” - i18n. Example: “user-visible copy only in
locales/zh-CN.json, keys likemodule.feature.label; no hard-coded Chinese in<template>.” - Security. Example: “never interpolate
${userInput}into SQL;.envnot in git; internal hosts only*.corp.internal.” - Performance. Example: “lists with
length > 200must use virtual scrolling; noJSON.parseof huge strings insidecomputed.”
Where models slip: they suggest code that runs but violates your red lines—bans + counterexamples in a Skill often beat a long architecture essay.
5️⃣ Code review rubric—turn “nits” into citeable clauses
Collect your most common review comments as bullets; one sentence each is often enough for the model to align:
- Naming and readability. Example: “verbs for functions, booleans prefixed
is/has; more than three levels of nesting must be extracted.” - Test boundaries. Example: “pure functions under
src/pure/**can skip tests; changes underpayment/orauth/must include unit tests.” - Logging and observability. Example: “log
request_idat public entry points; neverlogger.infoan entire PII payload.”
Value: the first draft looks closer to “would pass our PR,” with fewer “fix it again after review” round trips.
6️⃣ External system “incantations”—docs scattered, you search every time
Each bullet should capture the stable half and give the model a sentence it can reuse:
- Gateway and error codes. Example: “public prefix
/gw/v1;40102means token expired—client should silently refresh once.” - Release and rollback. Example: “prod goes through
deploy-prodjob; if red,kubectl rollout undo deployment/api -n prodfirst, then page the release owner.”
Note: Skills carry stable subsets; if information depends on live state (who is on call tonight, current rate limits), use links or MCP—do not treat the Skill as a live database. Example: write “on-call roster: fixed wiki page,” not “tonight is Alice.”
7️⃣ Redaction, security boundaries, and analytics events (growth / data folks ask often)
All three are high-frequency, easy to get wrong, and risky for compliance or public sentiment—good candidates for standalone Skills (optionally separate from engineering Skills so description fields do not fight for triggers).
- Redaction (logs, screenshots, model context, support scripts). Example: mask phone numbers as
138****8000; ID cards keep last six digits only; logs must never contain full tokens / cookies / card numbers; list which fields count as PII so generated sample data always uses fake identities and numbers. - Security boundaries. Example: production connection strings, signing private keys, KMS passphrases never pasted into chat or Skill body (the Skill only says which class of secret store and which env var names); internal admin URLs never in externally visible copy; generated code must not use
eval/ dynamicFunctionon user input. - Analytics / events. Example: event names like
module_page_action(e.g.order_list_expose); common properties fixed asapp_version,channel—never put phone numbers oropenidinto event properties; table of first-screen funnel events and optional params; how to compose A/B fields with your experiment platform (if any).
Where models slip: invented event names, plaintext in properties, or treating redaction as “roughly mask”—tables + counterexamples work best.
8️⃣ Design specs and handoff—for designers and for people using AI for UI / copy
The point of a design-system Skill is to keep AI-generated drafts consistent with what you ship in Figma and your component library, not “pretty but wrong brand.” Usually you use three layers together: pre-ship QA checklist, component and token cheat sheet (aligned with code), and business context and common terms (grounded in real flows).
Typical content and examples:
- Tokens and spacing. Example: “corner radius only
sm/md/lgmapped to design tokens; do not hand-type#1a1a1aas primary text—usecolor.text.primary.” - Components and states. Example: “only one primary button height; empty, loading, and error states each use the designated components—do not invent placeholder blocks.”
- Exports and annotations. Example: “export
@2xPNG or SVG; icons on a 24px grid; asset namesic_feature_state.” - Copy and tone. Example: “errors use a ‘please try again later’ tone, not blaming the user; titles at most two lines.”
- Accessibility floor. Example: “body text vs background at least WCAG AA contrast; hit targets at least 44×44 (or your adopted minimum).”
Designers can treat the Skill as a briefing for Copilot / chat models: when writing specs, revising drafts, or producing multilingual notes, bring grid and token constraints in round one instead of pixel-pushing color values back and forth.
See the table below for acme-privacy-redaction, acme-analytics-events, and acme-design-system-handoff.
9️⃣ When it must be steady and exact—let AI orchestrate; put truth in the next layer
SKILL.md is a good home for process, constraints, and entry commands; it is a poor home for “must be computed or looked up correctly inside the chat.” Big-number sums, complex aggregates, reconciliation diffs, regex edge cases, timezone math, authorization outcomes—the model can look plausible yet be occasionally wrong, and every correction costs more tokens.
Steadier pattern:
- Skill body: when you must run which script / HTTP call / MCP, how to pass arguments, how to read the returned JSON or exit code—the model chooses the path, assembles arguments, summarizes output, it does not do the math by hand.
scripts/(or an existing repo CLI): put deterministic logic in Node / Python / Shell—testable, versioned; same input, same output.- Internal APIs / read-only query services: for exact numbers or live data, use controlled calls and treat the response as fact; do not let the model “estimate from training memory.”
Example: user asks “how many lines in this CSV”—the Skill requires wc -l or your wrapper node scripts/count-lines.mjs first; the number in the answer must come from stdout, not mental math. Example: reconciliation—must run scripts/reconcile.mjs --input ledger.json; the model only explains diff_cents from script output, never recomputes the ledger in chat.
The LLM orchestrates and explains; precision comes from programs or the database. Official Skill layouts already support scripts/, references/, etc.—precisely to move the “easy to get wrong” parts out of the chat.
🚫 Usually not worth a dedicated Skill (common misjudgments)
| Type | Why | Better home | Example |
|---|---|---|---|
| One-off | Disposable | Say it in the current thread | “Turn this JSON into CSV” once—no Skill. |
| Whole architecture book | Too long, triggers hard to control | Split into narrow Skills, or repo docs/ | Fifty pages of domain model → split into orders-api-conventions and billing-events. |
| Schedule that changes every minute | Not stable knowledge | Calendar / ticket system | “Who owns story 3721 this week?”—ask Jira/calendar, not a Skill. |
| Pure data lookup | Hallucination risk | MCP, internal API, read-only query tools | “East China GMV last quarter”—use BI or MCP, not training-memory guesses. (Same idea as “9. When it must be steady and exact…” above.) |
📊 After you write it: how to tell you really saved turns
You can judge subjectively or add light metrics; each bullet should be observable:
- Before. Example: “add CRUD API” averages four turns—turn two fixes pagination, turn three fixes error shape, turn four adds
request_id. - After. Example: similar tasks show
cursor/limitand a unified error body in round one; corrections drop to 0–1 turns. - Trigger probes. Write a few real user phrases: scenarios that should load the Skill, and confusing ones that should not; re-test with paraphrases (same spirit as official “should / should not trigger” guidance).
- Bad signals. Example:
design-tokensloads on “how do I optimize this SQL”—descriptionis too broad or overlaps another Skill, wasting tokens.
Fewer turns means less repeated loading of the same “correction paragraph”—less wasted context and retry cost. A Skill is not better because it is longer; a good Skill reliably cuts correction count.
📐 Mini structure template
The main file must be named SKILL.md; the skill directory uses kebab-case; do not put README.md inside the pack (put notes in SKILL.md or references/). Front matter description has a length cap; do not wrap content in angle brackets; do not prefix skill names with claude- or anthropic- (platform reserved).
Keep the body short and decisive: triggers in front matter; body is must / forbid / example. If there are deterministic steps (stats, reconciliation, formatting huge files), the body must name the scripts/ or CLI entry—do not let the model do the work by hand in chat.
---
name: acme-api-new-endpoint
description: Add HTTP endpoints on the Acme service. Use when the user asks for REST routes, controllers, handlers, or “expose a new API.”
---
# Acme new-endpoint checklist
## Must
- Route prefix: `/api/v2`
- Pagination: `limit` + `cursor`; ban `offset` pagination
- Error body: `{ "code", "message", "request_id" }`
## Forbidden
- Raw SQL string-built in handlers
- Exposing internal ids without auth
## Example (snippet)
```typescript
// POST /api/v2/orders — 422 on body validation + { code, message, request_id }
export async function createOrder(req: Request) {
const requestId = req.headers.get("x-request-id") ?? randomUUID();
// ...
}
```
Keep description narrow: prefer several small Skills over one “full-stack omnibus” that fights other Skills for triggers. Bad: “use when touching backend, database, performance, or security.” Good: “use when adding or changing database schema (add/drop columns, table structure).”
📦 Fictional team examples
Below is a fictional “Acme” team and a sample set of Agent Skills they might maintain—for reference only.
| Theme | Typical folder name |
|---|---|
| Team defaults | acme-team-defaults |
| List-endpoint scaffold | acme-rest-list-endpoint |
| Endpoint implementation guide | acme-api-endpoint-guide |
| Front-end team conventions | acme-frontend-team-guide |
| Code review rubric | acme-pr-review-rubric |
| Platform / release stable subset | acme-platform-runbook |
| Cron and queue best practices | acme-scheduled-tasks-guide |
| Redaction and log safety | acme-privacy-redaction |
| Analytics naming and properties | acme-analytics-events |
| Design specs and handoff (designers) | acme-design-system-handoff |
| Backend map and stubs | acme-backend-landscape |
🧭 Series map (sibling posts)
| Part | Post | What it covers |
|---|---|---|
| 1 | Pain points and motivation | Why Skills |
| 2 | Concepts in practice | SKILL.md structure and progressive disclosure |
| 3 | Tooling | skill-base / skb distribution and install |
| 5 | Admin query page case study | A project-level Skill refactor example |
| 6 | Operating playbook | Triggers, conflicts, and version governance |
| Extra | Fragmentation | Trigger governance when many tools coexist |
📖 Further reading
- DeepLearning.AI: Agent Skills with Anthropic—short course entry (pairs with the crosswalk above).
- Anthropic: Agent Skills overview—official notes on skill directories,
description, testing, and troubleshooting; if another channel disagrees, follow the official docs. - Concepts in practice—
SKILL.mdformat and progressive disclosure; read this when you are unsure how to structure the body. - Fragmentation—boundaries and triggers when many Skills coexist; the more you write, the more you need narrow
descriptionfields aligned with this post.
✍️ Try keeping a scratchpad for a week: how many times you repeated the same correction to the model. After two weeks, the highest-frequency lines are usually your next Skill candidates—for example, if you keep saying “the PR is missing migration notes,” consider a narrow Skill like pr-checklist-db.