Errors

Every error the SDK raises inherits from robotrace.RobotraceError. Catch by type, not by parsing message strings - the messages are human-readable and may change between minor versions. The types are stable and follow the same "sacred contract" rule as log_episode.

The hierarchy

RobotraceError
├── ConfigurationError       # missing api_key / base_url, bad path, etc.
├── TransportError           # network / timeout / DNS / TLS
└── APIError                 # the server responded with an error
    ├── AuthError            # 401 - bad / missing / revoked key
    ├── NotFoundError        # 404 - episode id doesn't exist (or cross-tenant)
    ├── ConflictError        # 409 - episode is archived, etc.
    ├── ValidationError      # 400 - payload didn't match the schema
    ├── RateLimitError       # 429 - quota tripped (carries `retry_after`)
    └── ServerError          # 5xx - flag for retries

APIError and its subclasses carry two extra attributes for debugging:

exc.status_code   # int - the HTTP status the server returned
exc.response_body # parsed JSON body (or raw text on non-JSON 5xx)

When you'll see each one

`ConfigurationError`

The SDK is missing or misconfigured. Caught at the call site, never reaches the network. Common cases:

api_key not passed and ROBOTRACE_API_KEY not set
base_url not passed and ROBOTRACE_BASE_URL not set
A path passed to ep.upload(kind, path) doesn't exist
You called ep.upload(kind, ...) on a metadata-only run (one opened with artifacts=[]) - the SDK fails loud rather than silently dropping bytes

from robotrace import ConfigurationError
 
try:
    rt.log_episode(name="oops", video="/missing/file.mp4")
except ConfigurationError as exc:
    print(f"fix your inputs: {exc}")

Don't retry - the inputs need to change first.

`TransportError`

The HTTP request failed before the server could respond. DNS, TCP reset, TLS handshake, or a timeout. The request is not known to have landed, so retrying with backoff is generally safe:

from robotrace import TransportError
import time
 
for attempt in range(3):
    try:
        rt.log_episode(...)
        break
    except TransportError:
        if attempt == 2:
            raise
        time.sleep(2 ** attempt)  # 1, 2, 4 seconds

The SDK doesn't auto-retry because what's safe depends on the call: re-trying a start_episode after a transport error is fine (server might have created the row twice, but each gets a unique id); re-trying an upload PUT against an expired signed URL just wastes bytes.

`AuthError` (401)

The API key is missing, malformed, or revoked. Don't retry - mint a fresh key from Portal → API keys.

from robotrace import AuthError
 
try:
    rt.log_episode(...)
except AuthError as exc:
    alerts.notify(
        "RoboTrace key needs rotation",
        details=str(exc),
    )
    raise

`NotFoundError` (404)

The episode id doesn't exist, or belongs to a different client. We deliberately make these two cases indistinguishable server-side to avoid a UUID-enumeration oracle.

This won't happen during normal log_episode(...) flow - you only see it if you constructed an Episode from a stale id and tried to finalize it.

`ConflictError` (409)

The request is well-formed but conflicts with current server state. The most common cause: trying to finalize(...) an episode that's already been archived.

Restore the episode from /portal/episodes/<id> (or just start a fresh one) before retrying.

`ValidationError` (400)

The payload didn't pass server-side validation. The server's error field tells you which constraint tripped:

from robotrace import ValidationError
 
try:
    rt.log_episode(name="x" * 500, ...)  # name is capped at 200 chars
except ValidationError as exc:
    print(exc)                # human message
    print(exc.response_body)  # {'error': 'name must be ≤ 200 chars'}

Don't retry without changing the inputs.

`RateLimitError` (429)

The server rejected the request because a quota was tripped - too many uploads from one client over the rate window, ingest-throttle on a specific endpoint, etc. The exception carries a parsed retry_after (integer seconds) sourced from the response's Retry-After header. None means the server didn't send one.

from robotrace import RateLimitError
import time
 
try:
    rt.log_episode(...)
except RateLimitError as exc:
    # `exc.retry_after` is the server's recommended wait, or None.
    wait = exc.retry_after or 30
    time.sleep(wait)
    rt.log_episode(...)  # try again

The SDK already auto-retries for you on the call sites where re-issuing the same request can never cause a duplicate row or a double-billing event:

Call	Auto-retries on 429?
`Client.start_episode(...)` (create)	yes (up to 4 total attempts)
`Episode.upload(kind, path)`	yes (signed PUT is idempotent)
`rt.evals.create_run(...)`	yes
`rt.evals.run_against(...)` per-result push	yes (server upserts)
`Episode.finalize(...)`	no - see below
`rt.evals.complete_run(...)`	no - same reason

Each retried call honors Retry-After when present (capped at 30 seconds so a misconfigured server can't pin a robot rig) and falls back to exponential backoff (1s, 2s, 4s) otherwise.

finalize and complete_run deliberately do not auto-retry - the server may have processed the mutation before the 429 was sent back, and silently re-finalizing in a future paid tier could double-bill artifact storage. Catch RateLimitError at the call site, sleep for exc.retry_after or 30 seconds, then retry yourself.

`ServerError` (5xx)

Something blew up on the server side - database hiccup, R2 signing failed, etc. Worth retrying with exponential backoff. The SDK deliberately does not auto-retry because retrying a finalize twice could double-bill artifact storage in future paid tiers.

from robotrace import ServerError
import time
 
for attempt in range(5):
    try:
        rt.log_episode(...)
        break
    except ServerError:
        if attempt == 4:
            raise
        time.sleep(2 ** attempt)  # 1, 2, 4, 8, 16

If ServerError persists past a few retries, check status.robotrace.dev (Phase 2) or ping us - there's likely an incident.

Catch-all pattern

For training scripts where you want one alert path for any RoboTrace problem without distinguishing types:

from robotrace import RobotraceError
 
try:
    rt.log_episode(...)
except RobotraceError as exc:
    # Anything from the SDK - auth, config, network, server.
    # User code bugs (TypeError, ValueError) still propagate.
    sentry_sdk.capture_exception(exc)
    raise

RobotraceError deliberately does not inherit from OSError / IOError - we don't want a blanket except Exception: in your training loop to silently eat our errors and leave you wondering why nothing's showing up in the portal.

Server vs SDK redaction

The SDK never logs:

The value of your API key
The body of an ingest request (which can carry trade secrets)
Signed PUT URLs (they expire fast but still)

The server side follows the same rule - ingest payloads and key material are never written to logs. If you find an exception message that leaks any of the above, it's a bug - please report it.

Errors

The hierarchy

When you'll see each one

ConfigurationError

TransportError

AuthError (401)

NotFoundError (404)

ConflictError (409)

ValidationError (400)

RateLimitError (429)

ServerError (5xx)