Hosted and managed by the University of Alabama in Huntsville

Agentarium

scientific agent registry
ToolsGovernanceSign in with ORCID
← All policies

Transparency Tiers

Every listed agent shows a transparency tier from 1 to 5. The tier is computed from the evidence the author has attached to the listing — it is not a quality rating, an accuracy score, or an endorsement. It tells readers how much of the agent's behavior is inspectable.

The registry's bias is toward making transparency visible rather than required. A Tier 1 listing is welcome; a Tier 4 listing is more useful to a careful reader; a Tier 5 listing has been independently reproduced.

The tiers

Tier Label Earned by
1 Conformant Required disclosures present (intended use, out-of-scope, known failures, declared guardrails, validation with a real caveat, worked example). Default for every accepted listing.
2 Benchmarked Tier 1, plus the agent links to a benchmark run on a recognized platform (AKD Labs, LangSmith, Weights & Biases, Weave, OTLP-as-S3, or a published dataset such as on Zenodo).
3 Reproducible eval Tier 2, plus the run's per-query results are reachable at the link — anyone can audit which queries passed and which failed.
4 Traceable Tier 3, plus ≥ 1 sample trace is attached. The trace must show the actual tool calls and the model interaction end-to-end.
5 Independently reproduced Tier 4, plus a recorded third-party reproduction by a different ORCID. The reproducer's identity, date, and outcome are visible.

Tiers are machine-checkable for Tiers 2–4: the registry pings the linked platform's public API at view time and verifies the resources exist and return the expected shape. A broken link drops the listing's tier until the author fixes it.

What "recognized platform" means

For Tier 2 and above, the linked evidence must live on a platform the registry can interpret:

  • AKD Labs — UAH-hosted; first-class integration. Run / trace / dataset summaries render inline on the agent page.
  • LangSmith, Weights & Biases, Weave — recognized by ID + URL; the registry shows a link-out with the platform name and a "self-hosted evidence" badge.
  • OTLP / OpenTelemetry — accepted as an exported JSON blob hosted on S3 or any HTTPS endpoint.
  • Zenodo — accepted for benchmark datasets and reproduction records.
  • Other — accepted as a URL only; the listing's tier is capped at 2 because the registry cannot verify the structure.

What the tier does NOT certify

The tier is about inspectability, not quality:

  • A Tier 4 listing can still be a bad agent. The validation numbers can be embarrassing. The tool trace can show the agent making poor decisions. That's the point — you can see for yourself.
  • A Tier 1 listing can be a great agent whose author hasn't published evaluation evidence (yet). The registry doesn't penalize working agents for missing eval infrastructure.
  • The tier system is not a competition between agents. It's a competition between agent listings and themselves over time — authors upgrade their listings as they accumulate evidence.

Earning Tier 5

Independent reproduction is the hardest tier to earn because it requires someone else to run the agent and post a result. The registry supports this via:

  1. Reproduction submissions — a separate workflow where an ORCID-verified third party submits a record (date, outcome, optional URL to their own run on any platform).
  2. Public proof of independence — the reproducer's ORCID is not the same as any author's ORCID; the registry checks this automatically.
  3. Outcome can be negative — a reproduction that fails to match the original is still recorded. Negative reproductions don't increase the tier but they do appear on the listing for readers.

Why this exists

A registry of scientific agents that doesn't expose evidence becomes a marketplace of claims. The tier system tries to make the shape of evidence legible without forcing every author to produce it. Authors who care about transparency get visible recognition; readers who care about transparency can filter and sort by tier.

This pattern is adapted from the ML Reproducibility Checklist and OpenReview's reproducibility badges, but anchored to the agent-specific artifacts (runs, traces, tool-usage profiles) that make agent behavior auditable in a way papers about agents typically cannot.

How tier changes happen

  • Author publishes a new version with more evidence attached → tier recomputes on submission.
  • A linked platform's public URL stops resolving → tier drops the next time the registry verifies (within a day). Author is notified and has 30 days to fix before any further action.
  • A reproduction is submitted and accepted → tier jumps to 5.
  • A reproduction record is withdrawn → tier drops back accordingly.

All tier changes are logged in the agent's public audit trail (/api/v1/audit/agent/{id}).