Turning Anthropic's Skills Into Executable Specifications

Two months ago, Anthropic introduced Agent Skills: folders of instructions that Claude loads dynamically to perform specialized tasks. Last week, OpenAI quietly adopted the same format in ChatGPT and Codex CLI. Skills are becoming the standard for runtime specialization.

But Skills only work at inference. They are instructions injected into the context window, sophisticated prompt engineering with file system access. The model reads them, hopefully follows them, and you observe what happens.

This raises two questions:

How do you verify a model is actually following a skill?
How do you train a model to execute a skill reliably?

Skills answer neither. CAPE answers both.

The Gap

Today, if you want a model to be reliably good at a skill, your options are limited:

Prompt harder. Write better SKILL.md files. This is literally what Skills are. Works sometimes. Fails unpredictably. No verification.

Fine-tune on examples. Collect demonstrations, run SFT, hope the model generalizes. But you have no specification of what "correct" means, just examples you hope capture it.

RLHF with skill-specific feedback. Have humans rate outputs. But preferences are not correctness. Raters cannot verify complex constraints, and you are paying per example, per skill, forever.

None of these let you say: "Here is what this skill requires. Verify every output. Correct violations. Train until the model satisfies these requirements by default."

The Conversion

We took all 16 skills from Anthropic's official repository and converted them to CAPE policy packs.

The result: executable specifications that define what correct skill execution looks like, not as instructions the model might follow, but as predicates the model must satisfy.

Anthropic Skills	CAPE Policies
Instructions at inference	Specifications verified at inference or training
Hope model follows	Verify every output, correct violations
Observe what happens	Know exactly what failed and why
Prompt engineering	Capability engineering

What We Found

About 44% of Anthropic's skills have clear, verifiable constraints out of the box:

Skill	Verifiable Constraints
pdf	Valid PDF structure, form fields typed correctly, merge page count preserved
docx	Valid OOXML, tracked changes have metadata, content preserved on edit
xlsx	Formulas parse, data types match declarations, schema compliant
pptx	Valid structure, layouts applied, media references resolve
slack-gif-creator	File size ≤ 64KB, dimensions 128x128, frame count > 1
webapp-testing	Valid syntax, assertions present, async methods awaited
artifacts-builder	Valid JSX, imports resolve, hooks rules followed

These map almost directly to predicates. The slack-gif-creator skill even includes size and dimension requirements in the text. We just formalized them.

The other 56% contain subjective guidance: "avoid generic AI slop," "pick a BOLD aesthetic direction," "write clear communications." These seem unverifiable.

But as we know from our CAPE research paper: there are no unverifiable skills, only underspecified ones.

Making the Subjective Objective

Every vague requirement becomes verifiable once you fix your context and name your assumptions. The question is not "is this verifiable?" but "what assumptions make this verifiable?"

Take frontend-design. The skill says to create "distinctive, production-grade interfaces" and "avoid generic AI slop." Sounds subjective. But what does "AI slop" actually look like structurally?

We fixed it:

Original Guidance	Our Assumption	Predicate
"avoid generic AI slop"	Slop = uniformity; good design = variety	`unique_border_radius_count(css) >= 3`
"avoid generic AI slop"	Intentional palette, not too few or too many	`unique_color_count(css) in [4, 12]`
"avoid generic AI slop"	Typographic restraint	`font_family_count(css) <= 3`
"production-grade"	Accessible by default	`min_color_contrast(output) >= 4.5`
"production-grade"	Works on multiple devices	`breakpoint_count(output) >= 2`

All verification is static: parsing CSS, counting unique values, calculating contrast ratios. No browser, no Lighthouse, no execution.

Are these the right assumptions? That depends on your context. But they are explicit assumptions, visible, adjustable, verifiable. And they are configurable:

High-end agency work:

min_border_radius_variety: 4
min_color_count: 6
max_font_families: 2

Rapid prototyping:

min_border_radius_variety: 1
min_color_count: 3
max_font_families: 4

The policy does not define quality. You do. The predicates verify your definition objectively.

With this approach, 100% of skills become specifiable.

The Full Conversion

Skill	Key Predicates	Assumptions
docx	`is_zip_archive()`, `xml_schema_valid()`, `text_content_preserved()`	None
pdf	`starts_with_pdf_header()`, `pdf_parser_succeeds()`, `form_field_mismatch_count() == 0`	None
pptx	`xml_schema_valid()`, `malformed_slide_count() == 0`, `missing_media_count() == 0`	None
xlsx	`xml_schema_valid()`, `formula_syntax_error_count() == 0`, `type_mismatch_count() == 0`	None
slack-gif-creator	`file_size_bytes() <= 65536`, `width() == 128`, `frame_count() > 1`	None
webapp-testing	`syntax_valid()`, `tests_without_assertions_count() == 0`, `unawaited_async_count() == 0`	None
artifacts-builder	`valid_jsx_syntax()`, `hooks_rules_violations() == 0`, `has_default_export()`	None
mcp-builder	`manifest_schema_valid()`, `tools_without_handlers_count() == 0`, `hardcoded_secret_count() == 0`	Minimal
skill-creator	`has_yaml_frontmatter()`, `frontmatter_name() == directory_name()`, `description_length() >= 50`	Light
theme-factory	`min_text_contrast_ratio() >= 4.5`, `has_dark_mode_variant()`, `spacing_scale_adherence() >= 0.8`	Light
brand-guidelines	`unauthorized_color_count() == 0`, `unauthorized_font_count() == 0`, `heading_sizes_descending()`	Light
frontend-design	`unique_border_radius_count() >= 3`, `min_color_contrast() >= 4.5`, `font_family_count() <= 3`	Medium
internal-comms	`flesch_kincaid_grade() <= 10`, `slang_match_count() == 0`, `sentiment_score() >= 0`	Medium
canvas-design	`format() in allowed_formats`, `content_coverage() >= 0.05`, `dominant_color_ratio() >= 0.1`	Medium
algorithmic-art	`valid_p5_syntax()`, `calls_random_seed()`, `color_function_count() >= 3`	Medium
doc-coauthoring	`word_count() >= 100`, `section_count() >= 2`, `heading_hierarchy_valid()`	Medium

Every policy pack includes an ASSUMPTIONS.md documenting what we fixed. Adopt them as-is, or adjust to your context.

Static Verification Only

A key principle: all predicates use static analysis of model outputs. No execution, no rendering, no external services.

Static (what we use)	Execution (what we avoid)
Parse file structure	Run code and check output
Validate XML schema	Open file in application
Check AST for patterns	Execute tests
Count and measure properties	Call external APIs
Calculate contrast ratios	Render and screenshot

This matters because CAPE verifies outputs, not behaviors. The model produces artifacts (code, documents, images). We verify those artifacts are structurally correct. Whether they behave correctly when executed is downstream, and often depends on context outside the model's control.

For webapp-testing, we verify the test code is well-formed: syntax valid, assertions present, async methods awaited. Whether those tests pass depends on the application under test. That is outside CAPE's scope.

For artifacts-builder, we verify valid JSX, correct hook usage, resolved imports. Whether the component renders beautifully depends on runtime context. But if our static checks pass, it will render.

Why This Matters

Skills are portable in theory, just markdown files. But they require each platform to implement runtime loading, and they provide no verification. Anthropic has skill loading. OpenAI just added it. But neither verifies that the model actually followed the skill correctly.

CAPE policies work differently. They define what the skill requires as executable predicates. This enables two things Skills cannot do:

Verification at inference. Run any model's output through the policy. Know immediately whether it satisfies the skill requirements. Correct violations before returning to the user.

Training synthesis. Use verified outputs as training data. The model learns what correct skill execution looks like through direct supervision, not preference optimization.

	Skills	CAPE Policies
Layer	Runtime only	Inference + Training
Mechanism	Instructions in context	Executable specifications
Verification	None	Deterministic predicates
Correction	None	Automatic, re-verified
Training path	Collect examples, hope	Generate verified data, supervise

The Flywheel

CAPE creates a compounding loop that Skills cannot:

Deploy with CAPE policies executing at inference on a frontier model
Verify every output against your specifications
Correct violations automatically
Collect verified outputs as training data (no annotation cost)
Train an owned model on verified-correct examples
Deploy the owned model, continue verifying
Repeat: the model improves, verification catches edge cases, training data grows

Your inference deployment funds your training dataset. Your policies work at both layers. The capability compounds.

Get the Policies

The full policy pack for all 16 Anthropic skills is available in our repository.

Each skill includes:

policy.cpl: The executable specification
MAPPING.md: Audit trail from skill guidance to predicates
ASSUMPTIONS.md: What we fixed to make it verifiable, with configuration profiles for different contexts