ARES,
explained properly
Agentic Readiness Evaluation Score · what the repo actually does, where the cleverness is, and where the heuristics stop being magic
ARES is a codebase-readiness evaluator for AI coding agents. It tries to answer one practical question: if you drop Claude Code into this repository, how likely is it to understand the system, find the right commands, stay inside safe change boundaries, and finish normal engineering work without constant human rescue?
Primary surface: a Claude Code skill installed as/ares. Secondary surface: a deterministic local scanner for baseline structural scoring.
~/.claude/skills/aresMost repo health tools measure generic engineering hygiene: linting, tests, CI, maybe code smells. ARES shifts the question. It measures whether a codebase is operable by an AI agent.
That makes it less about elegance and more about friction: can the agent find the right files, understand the boundaries, discover safe commands, validate a change, and recover when things go wrong?
So the repo is really building an AI-operations rubric for software projects, not another lint pass wearing a trench coat.
Two artifacts per review
- a short in-chat verdict inside Claude Code
- a full markdown report written into the target repo
- scored categories, risks, likely failure modes, and first fixes
ARES is strongest when you think of it as skill-driven judgment with a scanner-backed baseline, not as a single monolithic scoring engine.
Ctrl/Cmd + wheel to zoom. Scroll to pan. Drag to pan when zoomed. Double-click to fit.
Judgment engine
skills/ares/SKILL.md is the product surface. It tells Claude not to rely on filenames alone, not to run repo-controlled commands, and not to inspect secret-bearing files. It forces evidence-backed scoring and report writing.
Structural baseline
src/scanner.mjs + src/analyzers/*.mjs do deterministic heuristics: file counts, test ratios, lockfiles, CI markers, TS strict mode, giant files, catch-all dirs, agent docs, validator libs, and so on.
CLAUDE.md. Good. If they had tried to make heuristics pretend to be judgment, the whole thing would collapse into score theater.| Code | Category | What it asks | Why it matters for agents |
|---|---|---|---|
MRC |
Context & Intent | Can the agent form a correct mental model from docs and guidance? | High leverage |
NAV |
Navigability & Discoverability | Can it find the right files without thrashing? | structure, naming, file size, entrypoints |
TSC |
Contracts & Explicitness | Are interfaces typed and explicit enough to change safely? | schemas, types, validators, fewer implicit assumptions |
TEST |
Validation Infrastructure | Can changes be checked credibly? | High leverage |
ENV |
Local Operability | Can the agent discover how to run and verify the repo locally? | High leverage |
MOD |
Change Boundaries & Modularity | Can it make focused changes without huge blast radius? | High leverage |
CON |
Conventions & Example Density | Are there stable patterns the agent can imitate? | Lower leverage |
ERR |
Diagnostics & Recoverability | When things fail, does the repo tell you why? | logs, unhappy-path tests, observability |
CICD |
Automated Feedback Loops | Is there reliable non-local validation? | CI gates and automated review pressure |
AGT |
Agent Guidance & Guardrails | Was the repo prepared for AI agents on purpose? | High leverage |
ARES changes category weights by repo type: CLI, library, app, service, monorepo. That means a tiny CLI is not judged like a sprawling monorepo. A surprisingly sane choice.
If key conditions are weak — no install/test path, weak validation loop, no CI in a non-trivial repo, missing agent guidance, bad change boundaries — the overall score gets capped. That prevents fake-high scores from superficial polish.
- README exists? nice.
- Tests exist? maybe.
- CI exists? check.
- Score comes from visible repo markers.
- Fast and repeatable, but shallow.
- Claude must inspect actual files, not just filenames.
- Mandatory evidence set prevents lazy vibes-based scoring.
- Secret-bearing files are intentionally excluded.
- Rubric anchors + caps constrain the model.
- Result is slower, but much closer to the real question: could an agent survive here?
Still pattern-based
Many analyzers are smart grep plus repo-shape inference: giant files, utils/, TS strict mode, lockfiles, validator deps, test-to-source ratios. Useful, but nowhere near deep semantic understanding.
Read-first means no runtime truth
/ares intentionally avoids running the repo. Safer, yes. But that also means it cannot verify whether the documented commands actually work. It measures discoverability and apparent operability, not actual runtime health.
Examples of where the scanner can over- or under-estimate reality
- a repo can have a beautiful README and still be operationally broken
- a messy repo can work fine if the maintainers know its traps but never documented them
- a high test count can still hide brittle or irrelevant coverage
- modularity inference from file boundaries is useful, but not the same as understanding coupling
ARES is not a linter, not a traditional static analyzer, and not generic repo scoring. It is a practical attempt to measure whether AI coding agents can operate safely and effectively in a codebase.
Its best design decision is the split:
Deterministic scanner
consistent, cheap, explainable structural signals
Claude skill workflow
evidence-backed reasoning with explicit rules, caps, and reporting discipline
That architecture is the reason the repo is interesting. Without the scanner, the workflow is vague. Without the judgment layer, the score is fake precision. Together, it’s actually useful.