Replay regression harness
Re-roll a candidate policy against historical episodes recorded by a baseline policy, then read the diff in the portal. Same shape the marketing landing page promises - actual numbers now, populated from the customer's own training runs.
The runner is customer-side: your policy_callable runs on your
machine, against weights we never see. The SDK only uploads the
per-episode metric blob and a metadata-only "replay" episode so the
portal can drill from the eval row back to its replay.
Compares neatly to
log_episode'ssource="replay"flag - that's the episode source the harness stamps on every replay it mints. The two surfaces are designed to be used together.
When to reach for it
| You want to… | Reach for |
|---|---|
| Re-roll a new policy across yesterday's nightly batch | robotrace replay run (CLI) |
| Same thing from a CI pipeline / training script | robotrace.evals (Python) |
| One-off "does v13 beat v12?" check | robotrace replay run --dry-run |
| Promote a candidate to production after a sweep | Read complete_run(...)["summary"] |
The three verbs
The Python surface is intentionally small - three module-level functions, plus two dataclasses for the return values.
import robotrace as rt
run = rt.evals.create_run(
candidate_policy_version="pap-v4.0.0-rc1",
baseline_episode_ids=["ep-…", "ep-…", "ep-…"],
baseline_policy_version="pap-v3.2.1",
name="nightly v4 sweep",
)
rt.evals.run_against(run, policy_callable=my_policy)
summary = rt.evals.complete_run(run)
print(summary["summary"]["success_rate"]["delta"])That's the whole loop. Each verb is documented in detail below.
create_run(...)
Opens a new campaign on the server (status="pending") and seeds
one eval_results row per baseline episode. Returns an EvalRun
handle you carry through the loop.
def create_run(
*,
candidate_policy_version: str, # required
baseline_episode_ids: Sequence[str], # at least one
baseline_policy_version: str | None = None,
name: str | None = None,
metadata: Mapping[str, Any] | None = None,
client: Client | None = None,
) -> EvalRunbaseline_episode_ids must all belong to your client - the ingest
route runs a cross-tenant guard on every id and refuses the request
with a 404 if any one of them isn't yours. This is the same RLS
rule that protects the portal.
metadata is a free-form jsonb stamp on the run row. We recommend
stuffing the candidate weights hash, the runner machine, and any
CI build URL here - it shows up unmodified in the portal detail page.
run_against(run, *, policy_callable, ...)
Walks every baseline episode in run.baseline_episode_ids. For each
one:
- Downloads the baseline's
actions.npz+sensors.npzvia the artifact resolver (signed R2 GET, same auth guard as the portal). - Iterates per-step observations through your
policy_callable, collecting candidate actions. - Computes the per-episode metrics (success / reward / collisions / time-to-goal / L2 / OOD share).
- Calls
log_episode(source="replay", ...)so the portal has a real replay-episode row to drill into. The replay carriesmetadata.eval_run_id+metadata.baseline_episode_idso the episode detail page renders a "Part of eval run" pill. - Uploads the metric blob via
POST /api/ingest/eval-run/<id>/result.
def run_against(
run: EvalRun,
*,
policy_callable: PolicyCallable,
on_episode: Callable[[EvalResult], None] | None = None,
dry_run: bool = False,
) -> list[EvalResult]Failures inside policy_callable are caught per-episode - the
traceback is printed to your stderr (so you can debug locally) and
the result row lands as status="failed" with a truncated error
string. The loop keeps going so one bad observation doesn't sink the
whole campaign.
Pass on_episode= to get a callback after each baseline lands - the
CLI uses it to print live progress. Pass dry_run=True to skip the
log_episode + /result uploads entirely (still fetches baselines,
still computes metrics - useful for iterating on policy_callable
without polluting the portal).
complete_run(run)
Closes the campaign, triggers the server-side rollup, and returns the parsed response. CI scripts can read the rollup directly:
out = rt.evals.complete_run(run)
delta = out["summary"]["success_rate"]["delta"]
if delta is not None and delta < -0.05:
raise SystemExit(f"v4 lost {-delta:.1%} success - not shipping.")The shape of summary is documented under
Reading the rollup below.
The policy_callable contract
Observation = dict[str, Any] # one step's worth of sensors
Action = dict[str, Any] # one step's worth of actions
PolicyCallable = Callable[[Observation], Action]Both Observation and Action are plain dicts keyed by the
namespace the baseline was recorded with. For a ROS 2 episode logged
via the adapter you get:
def my_policy(obs):
# obs == {
# "/joint_states/position": ndarray[7],
# "/joint_states/velocity": ndarray[7],
# "/camera/rgb": ndarray[480, 640, 3],
# }
...
return {
"/cmd_vel/linear": np.array([0.2, 0.0, 0.0]),
"/cmd_vel/angular": np.array([0.0, 0.0, 0.1]),
}You can return any subset of the baseline's action keys - the L2 distance metric pairs them up by key and ignores any that don't match. Keys you make up that the baseline never recorded are also fine, they just get ignored when computing the L2.
The shape of the per-step arrays must match the baseline's shape at
the same key ((action_dim,), (H, W, 3), etc.) - see the
ROS 2 and LeRobot adapter
docs for the conventions each encoder follows.
The _outcome sentinel
The five marketing metrics include success, reward_total,
collision_count, and time_to_goal_s. The default behavior is to
read these from the baseline episode's metadata and assume the
candidate matches - fine for "did the candidate produce sensible
actions?" sweeps, useless for "did the candidate actually solve the
task better?" sweeps.
If your policy_callable can compute the outcome at the policy
level, return a _outcome key in the last action of the run:
def my_policy(obs):
if is_last_step(obs):
return {
"/cmd_vel/linear": np.array([0.0, 0.0, 0.0]),
"_outcome": {
"success": True,
"reward_total": 14.1,
"collision_count": 0,
"time_to_goal_s": 16.3,
},
}
return run_inference(obs)The runner pulls those values into the candidate columns of the
metric blob - success_candidate, reward_total_candidate, etc. -
and the portal DiffCard renders the delta against the baseline.
Keys that aren't in _outcome fall through to the baseline's value
(so the delta reads 0). The sentinel is opt-in: omit it and
the run still completes, you just see no movement on those metrics.
CLI: robotrace replay run
The CLI verb is robotrace.evals plus argparse plus pretty
progress output. Same semantics, one fewer Python file to babysit.
$ robotrace replay run \
--policy my_pkg.policies:v4_rc1 \
--candidate-version pap-v4.0.0-rc1 \
--baseline-episodes ep-d3e1,ep-9a4f,ep-b21c \
--baseline-version pap-v3.2.1 \
--name "nightly v4 sweep"
Replay run: candidate=pap-v4.0.0-rc1 baseline=pap-v3.2.1 episodes=3
✓ Eval run created: a17c9d2e-…
Portal: https://app.robotrace.dev/portal/evals/a17c9d2e-…
✓ [1/3] ep-d3e1 → c1f2a4b8
✓ [2/3] ep-9a4f → 7e3d1f2c
✓ [3/3] ep-b21c → 4a9bc7d1
success_rate baseline 0.67 candidate 0.83 Δ +0.17
reward_mean baseline 11.1 candidate 16.4 Δ +5.30
collision_rate baseline 1.67 candidate 0.33 Δ −1.34
time_to_goal_s baseline 23.5 candidate 19.1 Δ −4.40
ood_action_share baseline - candidate 0.06
Recommend: ship (4/4 metrics better).Flags
| Flag | Required | What |
|---|---|---|
--policy MODULE:FN | yes | Importable path to your policy callable |
--candidate-version V | yes | Stable identifier for the candidate (e.g. pap-v4.0.0-rc1) |
--baseline-episodes | yes | Comma-list of ids, or @path/to/ids.txt (one id per line) |
--baseline-version V | no | Stamps eval_runs.baseline_policy_version |
--name STR | no | Human label, falls back to candidate version |
--dry-run | no | Skip the upload - only fetches + computes locally |
--profile NAME | no | Pick a non-default profile from ~/.robotrace/credentials |
--policy module:fn
Same convention as gunicorn - package.subpackage.module:attr,
where attr can be dotted (pkg.mod:cls.method). The module has to
be importable from the directory where you run robotrace, so make
sure PYTHONPATH (or your project's editable install) covers it.
--baseline-episodes
Two forms:
# Inline, comma-separated
--baseline-episodes ep-d3e1,ep-9a4f,ep-b21c
# From a file (one id per line, blank lines ignored)
--baseline-episodes @nightly-2026-05-19.txtThe file form is what you want in CI - generate the list with a
short psql or portal-export script and feed it in.
Reading the rollup
complete_run(run) returns the same body the
POST /api/ingest/eval-run/<id>/finalize route writes. The
useful part is summary:
{
"success_rate": {"baseline": 0.67, "candidate": 0.83, "delta": 0.17, "delta_is_better": true},
"reward_mean": {"baseline": 11.1, "candidate": 16.4, "delta": 5.30, "delta_is_better": true},
"collision_rate": {"baseline": 1.67, "candidate": 0.33, "delta": -1.34, "delta_is_better": true},
"time_to_goal_s": {"baseline": 23.5, "candidate": 19.1, "delta": -4.40, "delta_is_better": true},
"ood_action_share": {"baseline": null, "candidate": 0.06, "delta": null, "delta_is_better": null},
"better_count": 4,
"metric_total": 4,
"recommend": "ship"
}delta_is_better is sign-aware: success_rate going up is good,
collision_rate going down is good - the server-side rollup already
knows which direction is which, so your CI script doesn't have to.
recommend is "ship" when a clear majority of metrics moved in
the right direction and "hold" otherwise. Conservative on purpose
- robotics teams don't ship on a coin flip.
Errors
robotrace.evals raises the same typed exceptions as the rest of
the SDK. See Errors for the full hierarchy.
| Exception | When |
|---|---|
ConfigurationError | Missing candidate_policy_version, empty baseline list |
AuthError | API key bad, revoked, or doesn't own the baseline episodes |
NotFoundError | A baseline episode id doesn't exist (or isn't yours) |
ValidationError | Server rejected the metric payload (out-of-shape, etc.) |
TransportError | Network failure during artifact download or result POST |
Errors inside policy_callable are caught per-episode and recorded
as status="failed" rows on the result table - they don't propagate
out of run_against. The traceback is still printed to stderr for
local debugging.
What we explicitly don't ship (V0)
The harness is intentionally narrow in V0 - see the project canvas for what graduates in V1:
- Webhooks (
eval_run.completedevent) - V1, Team-tier bullet. - Hosted runner - Phase 2+. The schema already has the
runner_kindcolumn to accommodate one without a migration. - Cross-run trendlines (
v13vsv14vsv15) - V1. - CI-triggered regressions - V1.
Don'ts
- Don't rely on the
_outcomesentinel for safety-critical shipping decisions without also computing the metric out-of-band- the runner trusts what your policy returns, by design.
- Don't call
run_againstfrom inside an asyncio loop. The runner is sync; wrap the whole call inasyncio.to_thread(...)if you must. - Don't put weights or trade secrets in
metadata={...}. Stick to hashes and CI URLs - the metadata blob is visible to anyone in your org who can see the portal. - Don't delete a baseline episode while a client runner still has
in-flight uploads for that eval run unless you accept losing the matching
eval_resultsrows: deleting an episode cascades deletes oneval_results.baseline_episode_id, which keeps the portal consistent but can drop per-episode rows you expected to fill. Prefer letting the runner finish or cancel cleanly first.