Turning Anthropic's Skills Into Executable Specifications

December 2025

Two months ago, Anthropic introduced Agent Skills: folders of instructions that Claude loads dynamically to perform specialized tasks. Last week, OpenAI quietly adopted the same format in ChatGPT and Codex CLI. Skills are becoming the standard for runtime specialization.

But Skills only work at inference. They are instructions injected into the context window, sophisticated prompt engineering with file system access. The model reads them, hopefully follows them, and you observe what happens.

This raises two questions:

  1. How do you verify a model is actually following a skill?
  2. How do you train a model to execute a skill reliably?

Skills answer neither. CAPE answers both.

The Gap

Today, if you want a model to be reliably good at a skill, your options are limited:

Prompt harder. Write better SKILL.md files. This is literally what Skills are. Works sometimes. Fails unpredictably. No verification.

Fine-tune on examples. Collect demonstrations, run SFT, hope the model generalizes. But you have no specification of what "correct" means, just examples you hope capture it.

RLHF with skill-specific feedback. Have humans rate outputs. But preferences are not correctness. Raters cannot verify complex constraints, and you are paying per example, per skill, forever.

None of these let you say: "Here is what this skill requires. Verify every output. Correct violations. Train until the model satisfies these requirements by default."

The Conversion

We took all 16 skills from Anthropic's official repository and converted them to CAPE policy packs.

The result: executable specifications that define what correct skill execution looks like, not as instructions the model might follow, but as predicates the model must satisfy.

Anthropic SkillsCAPE Policies
Instructions at inferenceSpecifications verified at inference or training
Hope model followsVerify every output, correct violations
Observe what happensKnow exactly what failed and why
Prompt engineeringCapability engineering

What We Found

About 44% of Anthropic's skills have clear, verifiable constraints out of the box:

SkillVerifiable Constraints
pdfValid PDF structure, form fields typed correctly, merge page count preserved
docxValid OOXML, tracked changes have metadata, content preserved on edit
xlsxFormulas parse, data types match declarations, schema compliant
pptxValid structure, layouts applied, media references resolve
slack-gif-creatorFile size ≤ 64KB, dimensions 128x128, frame count > 1
webapp-testingValid syntax, assertions present, async methods awaited
artifacts-builderValid JSX, imports resolve, hooks rules followed

These map almost directly to predicates. The slack-gif-creator skill even includes size and dimension requirements in the text. We just formalized them.

The other 56% contain subjective guidance: "avoid generic AI slop," "pick a BOLD aesthetic direction," "write clear communications." These seem unverifiable.

But as we know from our CAPE research paper: there are no unverifiable skills, only underspecified ones.

Making the Subjective Objective

Every vague requirement becomes verifiable once you fix your context and name your assumptions. The question is not "is this verifiable?" but "what assumptions make this verifiable?"

Take frontend-design. The skill says to create "distinctive, production-grade interfaces" and "avoid generic AI slop." Sounds subjective. But what does "AI slop" actually look like structurally?

We fixed it:

Original GuidanceOur AssumptionPredicate
"avoid generic AI slop"Slop = uniformity; good design = varietyunique_border_radius_count(css) >= 3
"avoid generic AI slop"Intentional palette, not too few or too manyunique_color_count(css) in [4, 12]
"avoid generic AI slop"Typographic restraintfont_family_count(css) <= 3
"production-grade"Accessible by defaultmin_color_contrast(output) >= 4.5
"production-grade"Works on multiple devicesbreakpoint_count(output) >= 2

All verification is static: parsing CSS, counting unique values, calculating contrast ratios. No browser, no Lighthouse, no execution.

Are these the right assumptions? That depends on your context. But they are explicit assumptions, visible, adjustable, verifiable. And they are configurable:

High-end agency work:

  • min_border_radius_variety: 4
  • min_color_count: 6
  • max_font_families: 2

Rapid prototyping:

  • min_border_radius_variety: 1
  • min_color_count: 3
  • max_font_families: 4

The policy does not define quality. You do. The predicates verify your definition objectively.

With this approach, 100% of skills become specifiable.

The Full Conversion

SkillKey PredicatesAssumptions
docxis_zip_archive(), xml_schema_valid(), text_content_preserved()None
pdfstarts_with_pdf_header(), pdf_parser_succeeds(), form_field_mismatch_count() == 0None
pptxxml_schema_valid(), malformed_slide_count() == 0, missing_media_count() == 0None
xlsxxml_schema_valid(), formula_syntax_error_count() == 0, type_mismatch_count() == 0None
slack-gif-creatorfile_size_bytes() <= 65536, width() == 128, frame_count() > 1None
webapp-testingsyntax_valid(), tests_without_assertions_count() == 0, unawaited_async_count() == 0None
artifacts-buildervalid_jsx_syntax(), hooks_rules_violations() == 0, has_default_export()None
mcp-buildermanifest_schema_valid(), tools_without_handlers_count() == 0, hardcoded_secret_count() == 0Minimal
skill-creatorhas_yaml_frontmatter(), frontmatter_name() == directory_name(), description_length() >= 50Light
theme-factorymin_text_contrast_ratio() >= 4.5, has_dark_mode_variant(), spacing_scale_adherence() >= 0.8Light
brand-guidelinesunauthorized_color_count() == 0, unauthorized_font_count() == 0, heading_sizes_descending()Light
frontend-designunique_border_radius_count() >= 3, min_color_contrast() >= 4.5, font_family_count() <= 3Medium
internal-commsflesch_kincaid_grade() <= 10, slang_match_count() == 0, sentiment_score() >= 0Medium
canvas-designformat() in allowed_formats, content_coverage() >= 0.05, dominant_color_ratio() >= 0.1Medium
algorithmic-artvalid_p5_syntax(), calls_random_seed(), color_function_count() >= 3Medium
doc-coauthoringword_count() >= 100, section_count() >= 2, heading_hierarchy_valid()Medium

Every policy pack includes an ASSUMPTIONS.md documenting what we fixed. Adopt them as-is, or adjust to your context.

Static Verification Only

A key principle: all predicates use static analysis of model outputs. No execution, no rendering, no external services.

Static (what we use)Execution (what we avoid)
Parse file structureRun code and check output
Validate XML schemaOpen file in application
Check AST for patternsExecute tests
Count and measure propertiesCall external APIs
Calculate contrast ratiosRender and screenshot

This matters because CAPE verifies outputs, not behaviors. The model produces artifacts (code, documents, images). We verify those artifacts are structurally correct. Whether they behave correctly when executed is downstream, and often depends on context outside the model's control.

For webapp-testing, we verify the test code is well-formed: syntax valid, assertions present, async methods awaited. Whether those tests pass depends on the application under test. That is outside CAPE's scope.

For artifacts-builder, we verify valid JSX, correct hook usage, resolved imports. Whether the component renders beautifully depends on runtime context. But if our static checks pass, it will render.

Why This Matters

Skills are portable in theory, just markdown files. But they require each platform to implement runtime loading, and they provide no verification. Anthropic has skill loading. OpenAI just added it. But neither verifies that the model actually followed the skill correctly.

CAPE policies work differently. They define what the skill requires as executable predicates. This enables two things Skills cannot do:

Verification at inference. Run any model's output through the policy. Know immediately whether it satisfies the skill requirements. Correct violations before returning to the user.

Training synthesis. Use verified outputs as training data. The model learns what correct skill execution looks like through direct supervision, not preference optimization.

SkillsCAPE Policies
LayerRuntime onlyInference + Training
MechanismInstructions in contextExecutable specifications
VerificationNoneDeterministic predicates
CorrectionNoneAutomatic, re-verified
Training pathCollect examples, hopeGenerate verified data, supervise

The Flywheel

CAPE creates a compounding loop that Skills cannot:

  1. Deploy with CAPE policies executing at inference on a frontier model
  2. Verify every output against your specifications
  3. Correct violations automatically
  4. Collect verified outputs as training data (no annotation cost)
  5. Train an owned model on verified-correct examples
  6. Deploy the owned model, continue verifying
  7. Repeat: the model improves, verification catches edge cases, training data grows

Your inference deployment funds your training dataset. Your policies work at both layers. The capability compounds.

Get the Policies

The full policy pack for all 16 Anthropic skills is available in our repository.

Each skill includes:

  • policy.cpl: The executable specification
  • MAPPING.md: Audit trail from skill guidance to predicates
  • ASSUMPTIONS.md: What we fixed to make it verifiable, with configuration profiles for different contexts