Two months ago, Anthropic introduced Agent Skills: folders of instructions that Claude loads dynamically to perform specialized tasks. Last week, OpenAI quietly adopted the same format in ChatGPT and Codex CLI. Skills are becoming the standard for runtime specialization.
But Skills only work at inference. They are instructions injected into the context window, sophisticated prompt engineering with file system access. The model reads them, hopefully follows them, and you observe what happens.
This raises two questions:
- How do you verify a model is actually following a skill?
- How do you train a model to execute a skill reliably?
Skills answer neither. CAPE answers both.
The Gap
Today, if you want a model to be reliably good at a skill, your options are limited:
Prompt harder. Write better SKILL.md files. This is literally what Skills are. Works sometimes. Fails unpredictably. No verification.
Fine-tune on examples. Collect demonstrations, run SFT, hope the model generalizes. But you have no specification of what "correct" means, just examples you hope capture it.
RLHF with skill-specific feedback. Have humans rate outputs. But preferences are not correctness. Raters cannot verify complex constraints, and you are paying per example, per skill, forever.
None of these let you say: "Here is what this skill requires. Verify every output. Correct violations. Train until the model satisfies these requirements by default."
The Conversion
We took all 16 skills from Anthropic's official repository and converted them to CAPE policy packs.
The result: executable specifications that define what correct skill execution looks like, not as instructions the model might follow, but as predicates the model must satisfy.
| Anthropic Skills | CAPE Policies |
|---|---|
| Instructions at inference | Specifications verified at inference or training |
| Hope model follows | Verify every output, correct violations |
| Observe what happens | Know exactly what failed and why |
| Prompt engineering | Capability engineering |
What We Found
About 44% of Anthropic's skills have clear, verifiable constraints out of the box:
| Skill | Verifiable Constraints |
|---|---|
| Valid PDF structure, form fields typed correctly, merge page count preserved | |
| docx | Valid OOXML, tracked changes have metadata, content preserved on edit |
| xlsx | Formulas parse, data types match declarations, schema compliant |
| pptx | Valid structure, layouts applied, media references resolve |
| slack-gif-creator | File size ≤ 64KB, dimensions 128x128, frame count > 1 |
| webapp-testing | Valid syntax, assertions present, async methods awaited |
| artifacts-builder | Valid JSX, imports resolve, hooks rules followed |
These map almost directly to predicates. The slack-gif-creator skill even includes size and dimension requirements in the text. We just formalized them.
The other 56% contain subjective guidance: "avoid generic AI slop," "pick a BOLD aesthetic direction," "write clear communications." These seem unverifiable.
But as we know from our CAPE research paper: there are no unverifiable skills, only underspecified ones.
Making the Subjective Objective
Every vague requirement becomes verifiable once you fix your context and name your assumptions. The question is not "is this verifiable?" but "what assumptions make this verifiable?"
Take frontend-design. The skill says to create "distinctive, production-grade interfaces" and "avoid generic AI slop." Sounds subjective. But what does "AI slop" actually look like structurally?
We fixed it:
| Original Guidance | Our Assumption | Predicate |
|---|---|---|
| "avoid generic AI slop" | Slop = uniformity; good design = variety | unique_border_radius_count(css) >= 3 |
| "avoid generic AI slop" | Intentional palette, not too few or too many | unique_color_count(css) in [4, 12] |
| "avoid generic AI slop" | Typographic restraint | font_family_count(css) <= 3 |
| "production-grade" | Accessible by default | min_color_contrast(output) >= 4.5 |
| "production-grade" | Works on multiple devices | breakpoint_count(output) >= 2 |
All verification is static: parsing CSS, counting unique values, calculating contrast ratios. No browser, no Lighthouse, no execution.
Are these the right assumptions? That depends on your context. But they are explicit assumptions, visible, adjustable, verifiable. And they are configurable:
High-end agency work:
min_border_radius_variety: 4min_color_count: 6max_font_families: 2
Rapid prototyping:
min_border_radius_variety: 1min_color_count: 3max_font_families: 4
The policy does not define quality. You do. The predicates verify your definition objectively.
With this approach, 100% of skills become specifiable.
The Full Conversion
| Skill | Key Predicates | Assumptions |
|---|---|---|
| docx | is_zip_archive(), xml_schema_valid(), text_content_preserved() | None |
starts_with_pdf_header(), pdf_parser_succeeds(), form_field_mismatch_count() == 0 | None | |
| pptx | xml_schema_valid(), malformed_slide_count() == 0, missing_media_count() == 0 | None |
| xlsx | xml_schema_valid(), formula_syntax_error_count() == 0, type_mismatch_count() == 0 | None |
| slack-gif-creator | file_size_bytes() <= 65536, width() == 128, frame_count() > 1 | None |
| webapp-testing | syntax_valid(), tests_without_assertions_count() == 0, unawaited_async_count() == 0 | None |
| artifacts-builder | valid_jsx_syntax(), hooks_rules_violations() == 0, has_default_export() | None |
| mcp-builder | manifest_schema_valid(), tools_without_handlers_count() == 0, hardcoded_secret_count() == 0 | Minimal |
| skill-creator | has_yaml_frontmatter(), frontmatter_name() == directory_name(), description_length() >= 50 | Light |
| theme-factory | min_text_contrast_ratio() >= 4.5, has_dark_mode_variant(), spacing_scale_adherence() >= 0.8 | Light |
| brand-guidelines | unauthorized_color_count() == 0, unauthorized_font_count() == 0, heading_sizes_descending() | Light |
| frontend-design | unique_border_radius_count() >= 3, min_color_contrast() >= 4.5, font_family_count() <= 3 | Medium |
| internal-comms | flesch_kincaid_grade() <= 10, slang_match_count() == 0, sentiment_score() >= 0 | Medium |
| canvas-design | format() in allowed_formats, content_coverage() >= 0.05, dominant_color_ratio() >= 0.1 | Medium |
| algorithmic-art | valid_p5_syntax(), calls_random_seed(), color_function_count() >= 3 | Medium |
| doc-coauthoring | word_count() >= 100, section_count() >= 2, heading_hierarchy_valid() | Medium |
Every policy pack includes an ASSUMPTIONS.md documenting what we fixed. Adopt them as-is, or adjust to your context.
Static Verification Only
A key principle: all predicates use static analysis of model outputs. No execution, no rendering, no external services.
| Static (what we use) | Execution (what we avoid) |
|---|---|
| Parse file structure | Run code and check output |
| Validate XML schema | Open file in application |
| Check AST for patterns | Execute tests |
| Count and measure properties | Call external APIs |
| Calculate contrast ratios | Render and screenshot |
This matters because CAPE verifies outputs, not behaviors. The model produces artifacts (code, documents, images). We verify those artifacts are structurally correct. Whether they behave correctly when executed is downstream, and often depends on context outside the model's control.
For webapp-testing, we verify the test code is well-formed: syntax valid, assertions present, async methods awaited. Whether those tests pass depends on the application under test. That is outside CAPE's scope.
For artifacts-builder, we verify valid JSX, correct hook usage, resolved imports. Whether the component renders beautifully depends on runtime context. But if our static checks pass, it will render.
Why This Matters
Skills are portable in theory, just markdown files. But they require each platform to implement runtime loading, and they provide no verification. Anthropic has skill loading. OpenAI just added it. But neither verifies that the model actually followed the skill correctly.
CAPE policies work differently. They define what the skill requires as executable predicates. This enables two things Skills cannot do:
Verification at inference. Run any model's output through the policy. Know immediately whether it satisfies the skill requirements. Correct violations before returning to the user.
Training synthesis. Use verified outputs as training data. The model learns what correct skill execution looks like through direct supervision, not preference optimization.
| Skills | CAPE Policies | |
|---|---|---|
| Layer | Runtime only | Inference + Training |
| Mechanism | Instructions in context | Executable specifications |
| Verification | None | Deterministic predicates |
| Correction | None | Automatic, re-verified |
| Training path | Collect examples, hope | Generate verified data, supervise |
The Flywheel
CAPE creates a compounding loop that Skills cannot:
- Deploy with CAPE policies executing at inference on a frontier model
- Verify every output against your specifications
- Correct violations automatically
- Collect verified outputs as training data (no annotation cost)
- Train an owned model on verified-correct examples
- Deploy the owned model, continue verifying
- Repeat: the model improves, verification catches edge cases, training data grows
Your inference deployment funds your training dataset. Your policies work at both layers. The capability compounds.
Get the Policies
The full policy pack for all 16 Anthropic skills is available in our repository.
Each skill includes:
- policy.cpl: The executable specification
- MAPPING.md: Audit trail from skill guidance to predicates
- ASSUMPTIONS.md: What we fixed to make it verifiable, with configuration profiles for different contexts