ARES,
explained properly

Agentic Readiness Evaluation Score · what the repo actually does, where the cleverness is, and where the heuristics stop being magic

ARES is a codebase-readiness evaluator for AI coding agents. It tries to answer one practical question: if you drop Claude Code into this repository, how likely is it to understand the system, find the right commands, stay inside safe change boundaries, and finish normal engineering work without constant human rescue?

Primary surface: a Claude Code skill installed as /ares. Secondary surface: a deterministic local scanner for baseline structural scoring.
10
Readiness categories
docs, tests, setup, modularity, guidance, and more
2
Assessment modes
judgment-based Claude skill + deterministic scanner
/ares
Main interface
installs into ~/.claude/skills/ares
read-first
Default posture
avoid running repo code; avoid leaking secrets
1 — What it is

Most repo health tools measure generic engineering hygiene: linting, tests, CI, maybe code smells. ARES shifts the question. It measures whether a codebase is operable by an AI agent.

That makes it less about elegance and more about friction: can the agent find the right files, understand the boundaries, discover safe commands, validate a change, and recover when things go wrong?

So the repo is really building an AI-operations rubric for software projects, not another lint pass wearing a trench coat.

Core output

Two artifacts per review

  • a short in-chat verdict inside Claude Code
  • a full markdown report written into the target repo
  • scored categories, risks, likely failure modes, and first fixes
2 — Architecture
System shape

ARES is strongest when you think of it as skill-driven judgment with a scanner-backed baseline, not as a single monolithic scoring engine.

Ctrl/Cmd + wheel to zoom. Scroll to pan. Drag to pan when zoomed. Double-click to fit.

Loading...
Claude judgment layer
Deterministic scanner layer
Rubric gates / score caps
Final output
Claude skill layer

Judgment engine

skills/ares/SKILL.md is the product surface. It tells Claude not to rely on filenames alone, not to run repo-controlled commands, and not to inspect secret-bearing files. It forces evidence-backed scoring and report writing.

Scanner layer

Structural baseline

src/scanner.mjs + src/analyzers/*.mjs do deterministic heuristics: file counts, test ratios, lockfiles, CI markers, TS strict mode, giant files, catch-all dirs, agent docs, validator libs, and so on.

The important design choice: the scanner is treated as a support tool, not the final truth. The repo says this explicitly in its own CLAUDE.md. Good. If they had tried to make heuristics pretend to be judgment, the whole thing would collapse into score theater.
3 — The rubric
Code Category What it asks Why it matters for agents
MRC Context & Intent Can the agent form a correct mental model from docs and guidance? High leverage
NAV Navigability & Discoverability Can it find the right files without thrashing? structure, naming, file size, entrypoints
TSC Contracts & Explicitness Are interfaces typed and explicit enough to change safely? schemas, types, validators, fewer implicit assumptions
TEST Validation Infrastructure Can changes be checked credibly? High leverage
ENV Local Operability Can the agent discover how to run and verify the repo locally? High leverage
MOD Change Boundaries & Modularity Can it make focused changes without huge blast radius? High leverage
CON Conventions & Example Density Are there stable patterns the agent can imitate? Lower leverage
ERR Diagnostics & Recoverability When things fail, does the repo tell you why? logs, unhappy-path tests, observability
CICD Automated Feedback Loops Is there reliable non-local validation? CI gates and automated review pressure
AGT Agent Guidance & Guardrails Was the repo prepared for AI agents on purpose? High leverage
Profile-aware weighting

ARES changes category weights by repo type: CLI, library, app, service, monorepo. That means a tiny CLI is not judged like a sprawling monorepo. A surprisingly sane choice.

Score caps and gates

If key conditions are weak — no install/test path, weak validation loop, no CI in a non-trivial repo, missing agent guidance, bad change boundaries — the overall score gets capped. That prevents fake-high scores from superficial polish.

4 — Why this is better than a dumb scanner
Deterministic scanner only
ARES actual design
  • README exists? nice.
  • Tests exist? maybe.
  • CI exists? check.
  • Score comes from visible repo markers.
  • Fast and repeatable, but shallow.
  • Claude must inspect actual files, not just filenames.
  • Mandatory evidence set prevents lazy vibes-based scoring.
  • Secret-bearing files are intentionally excluded.
  • Rubric anchors + caps constrain the model.
  • Result is slower, but much closer to the real question: could an agent survive here?
Best part of the repo: it knows heuristics are not enough. File counts can tell you that tests exist; they cannot tell you whether those tests make future edits safe. ARES uses heuristics for signal gathering, then hands the final judgment to a constrained review workflow.
5 — Limits and blind spots
Heuristic limits

Still pattern-based

Many analyzers are smart grep plus repo-shape inference: giant files, utils/, TS strict mode, lockfiles, validator deps, test-to-source ratios. Useful, but nowhere near deep semantic understanding.

Execution limits

Read-first means no runtime truth

/ares intentionally avoids running the repo. Safer, yes. But that also means it cannot verify whether the documented commands actually work. It measures discoverability and apparent operability, not actual runtime health.

Examples of where the scanner can over- or under-estimate reality
  • a repo can have a beautiful README and still be operationally broken
  • a messy repo can work fine if the maintainers know its traps but never documented them
  • a high test count can still hide brittle or irrelevant coverage
  • modularity inference from file boundaries is useful, but not the same as understanding coupling
6 — Bottom line

ARES is not a linter, not a traditional static analyzer, and not generic repo scoring. It is a practical attempt to measure whether AI coding agents can operate safely and effectively in a codebase.

Its best design decision is the split:

Baseline

Deterministic scanner

consistent, cheap, explainable structural signals

Final judgment

Claude skill workflow

evidence-backed reasoning with explicit rules, caps, and reporting discipline

That architecture is the reason the repo is interesting. Without the scanner, the workflow is vague. Without the judgment layer, the score is fake precision. Together, it’s actually useful.