Replay regression harness

Re-roll a candidate policy against historical episodes recorded by a baseline policy, then read the diff in the portal. Same shape the marketing landing page promises - actual numbers now, populated from the customer's own training runs.

The runner is customer-side: your policy_callable runs on your machine, against weights we never see. The SDK only uploads the per-episode metric blob and a metadata-only "replay" episode so the portal can drill from the eval row back to its replay.

Compares neatly to log_episode's source="replay" flag - that's the episode source the harness stamps on every replay it mints. The two surfaces are designed to be used together.

When to reach for it

You want to…Reach for
Re-roll a new policy across yesterday's nightly batchrobotrace replay run (CLI)
Same thing from a CI pipeline / training scriptrobotrace.evals (Python)
One-off "does v13 beat v12?" checkrobotrace replay run --dry-run
Promote a candidate to production after a sweepRead complete_run(...)["summary"]

The three verbs

The Python surface is intentionally small - three module-level functions, plus two dataclasses for the return values.

import robotrace as rt
 
run = rt.evals.create_run(
    candidate_policy_version="pap-v4.0.0-rc1",
    baseline_episode_ids=["ep-…", "ep-…", "ep-…"],
    baseline_policy_version="pap-v3.2.1",
    name="nightly v4 sweep",
)
 
rt.evals.run_against(run, policy_callable=my_policy)
 
summary = rt.evals.complete_run(run)
print(summary["summary"]["success_rate"]["delta"])

That's the whole loop. Each verb is documented in detail below.

create_run(...)

Opens a new campaign on the server (status="pending") and seeds one eval_results row per baseline episode. Returns an EvalRun handle you carry through the loop.

def create_run(
    *,
    candidate_policy_version: str,                    # required
    baseline_episode_ids: Sequence[str],              # at least one
    baseline_policy_version: str | None = None,
    name: str | None = None,
    metadata: Mapping[str, Any] | None = None,
    client: Client | None = None,
) -> EvalRun

baseline_episode_ids must all belong to your client - the ingest route runs a cross-tenant guard on every id and refuses the request with a 404 if any one of them isn't yours. This is the same RLS rule that protects the portal.

metadata is a free-form jsonb stamp on the run row. We recommend stuffing the candidate weights hash, the runner machine, and any CI build URL here - it shows up unmodified in the portal detail page.

run_against(run, *, policy_callable, ...)

Walks every baseline episode in run.baseline_episode_ids. For each one:

  1. Downloads the baseline's actions.npz + sensors.npz via the artifact resolver (signed R2 GET, same auth guard as the portal).
  2. Iterates per-step observations through your policy_callable, collecting candidate actions.
  3. Computes the per-episode metrics (success / reward / collisions / time-to-goal / L2 / OOD share).
  4. Calls log_episode(source="replay", ...) so the portal has a real replay-episode row to drill into. The replay carries metadata.eval_run_id + metadata.baseline_episode_id so the episode detail page renders a "Part of eval run" pill.
  5. Uploads the metric blob via POST /api/ingest/eval-run/<id>/result.
def run_against(
    run: EvalRun,
    *,
    policy_callable: PolicyCallable,
    on_episode: Callable[[EvalResult], None] | None = None,
    dry_run: bool = False,
) -> list[EvalResult]

Failures inside policy_callable are caught per-episode - the traceback is printed to your stderr (so you can debug locally) and the result row lands as status="failed" with a truncated error string. The loop keeps going so one bad observation doesn't sink the whole campaign.

Pass on_episode= to get a callback after each baseline lands - the CLI uses it to print live progress. Pass dry_run=True to skip the log_episode + /result uploads entirely (still fetches baselines, still computes metrics - useful for iterating on policy_callable without polluting the portal).

complete_run(run)

Closes the campaign, triggers the server-side rollup, and returns the parsed response. CI scripts can read the rollup directly:

out = rt.evals.complete_run(run)
delta = out["summary"]["success_rate"]["delta"]
if delta is not None and delta < -0.05:
    raise SystemExit(f"v4 lost {-delta:.1%} success - not shipping.")

The shape of summary is documented under Reading the rollup below.

The policy_callable contract

Observation = dict[str, Any]   # one step's worth of sensors
Action      = dict[str, Any]   # one step's worth of actions
PolicyCallable = Callable[[Observation], Action]

Both Observation and Action are plain dicts keyed by the namespace the baseline was recorded with. For a ROS 2 episode logged via the adapter you get:

def my_policy(obs):
    # obs == {
    #   "/joint_states/position": ndarray[7],
    #   "/joint_states/velocity": ndarray[7],
    #   "/camera/rgb":            ndarray[480, 640, 3],
    # }
    ...
    return {
        "/cmd_vel/linear":  np.array([0.2, 0.0, 0.0]),
        "/cmd_vel/angular": np.array([0.0, 0.0, 0.1]),
    }

You can return any subset of the baseline's action keys - the L2 distance metric pairs them up by key and ignores any that don't match. Keys you make up that the baseline never recorded are also fine, they just get ignored when computing the L2.

The shape of the per-step arrays must match the baseline's shape at the same key ((action_dim,), (H, W, 3), etc.) - see the ROS 2 and LeRobot adapter docs for the conventions each encoder follows.

The _outcome sentinel

The five marketing metrics include success, reward_total, collision_count, and time_to_goal_s. The default behavior is to read these from the baseline episode's metadata and assume the candidate matches - fine for "did the candidate produce sensible actions?" sweeps, useless for "did the candidate actually solve the task better?" sweeps.

If your policy_callable can compute the outcome at the policy level, return a _outcome key in the last action of the run:

def my_policy(obs):
    if is_last_step(obs):
        return {
            "/cmd_vel/linear":  np.array([0.0, 0.0, 0.0]),
            "_outcome": {
                "success":         True,
                "reward_total":    14.1,
                "collision_count": 0,
                "time_to_goal_s":  16.3,
            },
        }
    return run_inference(obs)

The runner pulls those values into the candidate columns of the metric blob - success_candidate, reward_total_candidate, etc. - and the portal DiffCard renders the delta against the baseline.

Keys that aren't in _outcome fall through to the baseline's value (so the delta reads 0). The sentinel is opt-in: omit it and the run still completes, you just see no movement on those metrics.

CLI: robotrace replay run

The CLI verb is robotrace.evals plus argparse plus pretty progress output. Same semantics, one fewer Python file to babysit.

$ robotrace replay run \
    --policy my_pkg.policies:v4_rc1 \
    --candidate-version pap-v4.0.0-rc1 \
    --baseline-episodes ep-d3e1,ep-9a4f,ep-b21c \
    --baseline-version pap-v3.2.1 \
    --name "nightly v4 sweep"
Replay run: candidate=pap-v4.0.0-rc1 baseline=pap-v3.2.1 episodes=3
 Eval run created: a17c9d2e-…
  Portal: https://app.robotrace.dev/portal/evals/a17c9d2e-…
 
 [1/3] ep-d3e1 → c1f2a4b8
 [2/3] ep-9a4f → 7e3d1f2c
 [3/3] ep-b21c → 4a9bc7d1
 
success_rate     baseline 0.67  candidate 0.83  Δ +0.17
reward_mean      baseline 11.1  candidate 16.4  Δ +5.30
collision_rate   baseline 1.67  candidate 0.33  Δ −1.34
time_to_goal_s   baseline 23.5  candidate 19.1  Δ −4.40
ood_action_share baseline -     candidate 0.06
 
Recommend: ship (4/4 metrics better).

Flags

FlagRequiredWhat
--policy MODULE:FNyesImportable path to your policy callable
--candidate-version VyesStable identifier for the candidate (e.g. pap-v4.0.0-rc1)
--baseline-episodesyesComma-list of ids, or @path/to/ids.txt (one id per line)
--baseline-version VnoStamps eval_runs.baseline_policy_version
--name STRnoHuman label, falls back to candidate version
--dry-runnoSkip the upload - only fetches + computes locally
--profile NAMEnoPick a non-default profile from ~/.robotrace/credentials

--policy module:fn

Same convention as gunicorn - package.subpackage.module:attr, where attr can be dotted (pkg.mod:cls.method). The module has to be importable from the directory where you run robotrace, so make sure PYTHONPATH (or your project's editable install) covers it.

--baseline-episodes

Two forms:

# Inline, comma-separated
--baseline-episodes ep-d3e1,ep-9a4f,ep-b21c
 
# From a file (one id per line, blank lines ignored)
--baseline-episodes @nightly-2026-05-19.txt

The file form is what you want in CI - generate the list with a short psql or portal-export script and feed it in.

Reading the rollup

complete_run(run) returns the same body the POST /api/ingest/eval-run/<id>/finalize route writes. The useful part is summary:

{
  "success_rate":     {"baseline": 0.67, "candidate": 0.83, "delta": 0.17,  "delta_is_better": true},
  "reward_mean":      {"baseline": 11.1, "candidate": 16.4, "delta": 5.30,  "delta_is_better": true},
  "collision_rate":   {"baseline": 1.67, "candidate": 0.33, "delta": -1.34, "delta_is_better": true},
  "time_to_goal_s":   {"baseline": 23.5, "candidate": 19.1, "delta": -4.40, "delta_is_better": true},
  "ood_action_share": {"baseline": null, "candidate": 0.06, "delta": null,  "delta_is_better": null},
  "better_count": 4,
  "metric_total": 4,
  "recommend": "ship"
}

delta_is_better is sign-aware: success_rate going up is good, collision_rate going down is good - the server-side rollup already knows which direction is which, so your CI script doesn't have to.

recommend is "ship" when a clear majority of metrics moved in the right direction and "hold" otherwise. Conservative on purpose

  • robotics teams don't ship on a coin flip.

Errors

robotrace.evals raises the same typed exceptions as the rest of the SDK. See Errors for the full hierarchy.

ExceptionWhen
ConfigurationErrorMissing candidate_policy_version, empty baseline list
AuthErrorAPI key bad, revoked, or doesn't own the baseline episodes
NotFoundErrorA baseline episode id doesn't exist (or isn't yours)
ValidationErrorServer rejected the metric payload (out-of-shape, etc.)
TransportErrorNetwork failure during artifact download or result POST

Errors inside policy_callable are caught per-episode and recorded as status="failed" rows on the result table - they don't propagate out of run_against. The traceback is still printed to stderr for local debugging.

What we explicitly don't ship (V0)

The harness is intentionally narrow in V0 - see the project canvas for what graduates in V1:

  • Webhooks (eval_run.completed event) - V1, Team-tier bullet.
  • Hosted runner - Phase 2+. The schema already has the runner_kind column to accommodate one without a migration.
  • Cross-run trendlines (v13 vs v14 vs v15) - V1.
  • CI-triggered regressions - V1.

Don'ts

  • Don't rely on the _outcome sentinel for safety-critical shipping decisions without also computing the metric out-of-band
    • the runner trusts what your policy returns, by design.
  • Don't call run_against from inside an asyncio loop. The runner is sync; wrap the whole call in asyncio.to_thread(...) if you must.
  • Don't put weights or trade secrets in metadata={...}. Stick to hashes and CI URLs - the metadata blob is visible to anyone in your org who can see the portal.
  • Don't delete a baseline episode while a client runner still has in-flight uploads for that eval run unless you accept losing the matching eval_results rows: deleting an episode cascades deletes on eval_results.baseline_episode_id, which keeps the portal consistent but can drop per-episode rows you expected to fill. Prefer letting the runner finish or cancel cleanly first.