Transparency Tiers
Every listed agent shows a transparency tier from 1 to 5. The tier is computed from the evidence the author has attached to the listing — it is not a quality rating, an accuracy score, or an endorsement. It tells readers how much of the agent's behavior is inspectable.
The registry's bias is toward making transparency visible rather than required. A Tier 1 listing is welcome; a Tier 4 listing is more useful to a careful reader; a Tier 5 listing has been independently reproduced.
The tiers
| Tier | Label | Earned by |
|---|---|---|
| 1 | Conformant | Required disclosures present (intended use, out-of-scope, known failures, declared guardrails, validation with a real caveat, worked example). Default for every accepted listing. |
| 2 | Benchmarked | Tier 1, plus the agent links to a benchmark run on a recognized platform (AKD Labs, LangSmith, Weights & Biases, Weave, OTLP-as-S3, or a published dataset such as on Zenodo). |
| 3 | Reproducible eval | Tier 2, plus the run's per-query results are reachable at the link — anyone can audit which queries passed and which failed. |
| 4 | Traceable | Tier 3, plus ≥ 1 sample trace is attached. The trace must show the actual tool calls and the model interaction end-to-end. |
| 5 | Independently reproduced | Tier 4, plus a recorded third-party reproduction by a different ORCID. The reproducer's identity, date, and outcome are visible. |
Tiers are machine-checkable for Tiers 2–4: the registry pings the linked platform's public API at view time and verifies the resources exist and return the expected shape. A broken link drops the listing's tier until the author fixes it.
What "recognized platform" means
For Tier 2 and above, the linked evidence must live on a platform the registry can interpret:
- AKD Labs — UAH-hosted; first-class integration. Run / trace / dataset summaries render inline on the agent page.
- LangSmith, Weights & Biases, Weave — recognized by ID + URL; the registry shows a link-out with the platform name and a "self-hosted evidence" badge.
- OTLP / OpenTelemetry — accepted as an exported JSON blob hosted on S3 or any HTTPS endpoint.
- Zenodo — accepted for benchmark datasets and reproduction records.
- Other — accepted as a URL only; the listing's tier is capped at 2 because the registry cannot verify the structure.
What the tier does NOT certify
The tier is about inspectability, not quality:
- A Tier 4 listing can still be a bad agent. The validation numbers can be embarrassing. The tool trace can show the agent making poor decisions. That's the point — you can see for yourself.
- A Tier 1 listing can be a great agent whose author hasn't published evaluation evidence (yet). The registry doesn't penalize working agents for missing eval infrastructure.
- The tier system is not a competition between agents. It's a competition between agent listings and themselves over time — authors upgrade their listings as they accumulate evidence.
Earning Tier 5
Independent reproduction is the hardest tier to earn because it requires someone else to run the agent and post a result. The registry supports this via:
- Reproduction submissions — a separate workflow where an ORCID-verified third party submits a record (date, outcome, optional URL to their own run on any platform).
- Public proof of independence — the reproducer's ORCID is not the same as any author's ORCID; the registry checks this automatically.
- Outcome can be negative — a reproduction that fails to match the original is still recorded. Negative reproductions don't increase the tier but they do appear on the listing for readers.
Why this exists
A registry of scientific agents that doesn't expose evidence becomes a marketplace of claims. The tier system tries to make the shape of evidence legible without forcing every author to produce it. Authors who care about transparency get visible recognition; readers who care about transparency can filter and sort by tier.
This pattern is adapted from the ML Reproducibility Checklist and OpenReview's reproducibility badges, but anchored to the agent-specific artifacts (runs, traces, tool-usage profiles) that make agent behavior auditable in a way papers about agents typically cannot.
How tier changes happen
- Author publishes a new version with more evidence attached → tier recomputes on submission.
- A linked platform's public URL stops resolving → tier drops the next time the registry verifies (within a day). Author is notified and has 30 days to fix before any further action.
- A reproduction is submitted and accepted → tier jumps to 5.
- A reproduction record is withdrawn → tier drops back accordingly.
All tier changes are logged in the agent's public audit trail (/api/v1/audit/agent/{id}).