Object storage

Episode artifacts (video, sensors, actions) live in Cloudflare R2, an S3-compatible object store. Heavy bytes never touch the RoboTrace origin server — the SDK uploads them directly to R2 using short-lived signed PUT URLs.

Why R2

R2 charges $0 egress vs. S3's ~$0.09/GB. For a product where every episode page replays multi-GB videos to engineers' browsers, that single difference pays for the rest of the infra many times over. Storage and ops pricing is also slightly cheaper than S3:

R2S3 (Standard)
Storage$0.015 / GB-month$0.023 / GB-month
Egress$0~$0.09 / GB
Free tier10 GB + 1M Class A + 10M Class B / month, forever5 GB for 12 months

R2 speaks the S3 wire protocol, so any S3-compatible tool — including our @aws-sdk/client-s3 and the Python SDK — works against it by swapping the endpoint.

Dev-optional, prod-required

R2 is deliberately optional in development:

  • Without R2 (env vars blank) — the ingest endpoint returns storage: "unconfigured" and an empty upload_urls array. The SDK can still test the metadata path end-to-end. Episodes show up in /admin/episodes with no playable artifacts.
  • With R2 (all four env vars set) — the ingest endpoint mints signed PUT URLs and the SDK streams files straight to the bucket.

The Python SDK exposes the mode on Episode.storage so your training scripts can bail loud if they expected R2 and didn't get it.

Before pointing real users at any deployment, walk through the production setup checklist. Without R2 the product literally has no way to store the actual episode data — it'd just be a metadata browser.

Required env vars

R2_ACCOUNT_ID=…             # from Cloudflare → R2 sidebar
R2_ACCESS_KEY_ID=…          # from R2 → API Tokens → Create
R2_SECRET_ACCESS_KEY=…      # shown once at token creation
R2_BUCKET_EPISODES=…        # bucket name; we recommend "robotrace-episodes"
R2_PUBLIC_URL=              # optional; set when you connect a custom domain

Full Cloudflare clickpath in docs/PRODUCTION-SETUP.md → §1.

Bucket layout

Objects are keyed by episodes/<client_id>/<episode_id>/<file>:

episodes/
└── 8a4f01c2-…/                     # client id
    └── e8a4f01c-2b39-…/             # episode id
        ├── video.mp4
        ├── sensors.bin
        └── actions.parquet

This layout means:

  • A single client's data is a single prefix → easy to lifecycle, audit, or hard-delete.
  • Object names are predictable so the admin UI can construct fresh signed read URLs without a database lookup.
  • Filename always ends in the canonical extension expected for the artifact kind (helps when a CDN or browser sniffs MIME types).

Signed URL TTL

PUT URLs are valid 30 minutes after they're minted. That's:

  • Long enough for a slow uplink to push a multi-GB video.
  • Short enough that a leaked URL isn't a long-lived credential.

If your upload exceeds 30 minutes, the SDK currently re-calls POST /api/ingest/episode to mint fresh URLs (which creates a new episode row, today). A "regenerate URLs for an existing episode" endpoint is on the 0.2 roadmap.

Content-Type matters

Each PUT URL is signed with a specific Content-Type. The PUT must match or R2 returns 403. The SDK handles this for you; for raw HTTP clients see Ingest API → §2.

CORS

Phase 1 uploads come from the Python SDK, which doesn't need CORS. When the in-browser upload UI lands (Phase 3+), add a CORS rule on the bucket allowing PUT/GET from your app origin. Example in the production checklist.

Don'ts

  • Don't put episode bytes in Postgres. The DB row holds metadata
    • a URL; bytes live in R2. This is rule one in AGENTS.md.
  • Don't treat the public URL as authenticated. R2 buckets connected to a custom domain via R2_PUBLIC_URL are publicly readable by URL — that's why bucket keys include the random episode UUID (effectively unguessable). For the upcoming portal, read access will move behind signed GET URLs.
  • Don't hand out the R2_SECRET_ACCESS_KEY to clients or staff. Only the Vercel server runtime needs it — rotate it quarterly per the production checklist.