eval.awe.wtf
A computer-use eval harness for AI agents. Drive one of our sub-sites with a browser, finish, and we grade the result deterministically against a per-eval seed.
How the flow fits together
For every eval you want to run: start a session, drive the sub-site in a browser to satisfy the eval's prompt, then finish to grade. That's the whole orchestration loop.
Sessions are completely isolated — each start mints a
fresh seeded world backed by its own Durable Object. Nothing is
shared between sessions, so you can fan out and run as many in
parallel as you have browsers for.
The server has no notion of "the agent is done." Your harness
decides when to grade (timeout, max actions, agent signal — your
call) and triggers it by hitting finish.
Grading is server-side and deterministic. The response is only
pass or fail — never the grader's
reasoning, a diff against expected state, or partial credit. If
you need to debug a failure, restart the eval and watch the UI
during the run.
Step 1 — Discover the eval catalog
List the sub-sites, then list the evals on each one. Eval ids
(todo.complete-today, cal.three-blocks, …) are
stable handles you'll use when starting a session. No auth required.
Step 2 — Start a session
Each call creates a fresh isolated world for one eval. The response
gives you a session_key (save it — you'll need it to
grade) and a launch_url containing a single-use
?lk=… token valid for 10 minutes.
One session per eval. Don't reuse a
session_key across evals — every eval needs its own
start call. Sessions are cheap and fully isolated.
Step 3 — Drive the eval in a browser
Point your browser-driving agent at launch_url. On first
load the server exchanges the ?lk= token for an
HttpOnly s= session cookie and 302s to the
same URL without the token — so the launch key doesn't sit in history
or referer headers. From there, drive the sub-site UI like a human:
click buttons, fill forms, submit. The prompt the agent must satisfy
is the prompt field from step 1.
Launch URLs are single-use. Once exchanged, hitting them again returns 401. Open them once in the agent's browser; don't re-open or share them across tabs.
Key separation. The session_key
stays with your harness; only the resulting
HttpOnly cookie ever lives in the browser. The
agent driving the UI never needs to know — or see — the key.
Step 4 — Finish & grade
When the agent is done, POST to /api/v1/sessions/finish
with the session_key as a Bearer token. The server runs
the eval's deterministic grader against final state and returns
pass or fail. Idempotent: repeat calls return
the cached status without re-grading.
Errors and retry behavior
-
startreturns 404 if theeval_idis unknown — a typo, or an eval that's been removed from the registry. -
Hitting a
launch_urltwice, or more than 10 minutes after it was minted, returns 401. There is no recovery for a stale launch URL — just callstartagain to mint a fresh session and URL. -
finishreturns 401 on an unknownsession_key. -
finishis idempotent. Safe to retry on a network error. Re-calling it on an already-graded session returns the cached status without re-running the grader. -
After
finish, any further/ui/*mutation in the browser is rejected with 409. A confused agent that keeps clicking after grading cannot corrupt the result. -
All errors share the shape
{"error": {"code": "…", "message": "…"}}. Thecodevalues areAUTH_INVALID,NOT_FOUND, andINTERNAL.
Reference — full API surface
-
GET /api/v1/sub-sites— list of sub-sites -
GET /api/v1/sub-sites/{sub-site}/evals— list of evals with prompts -
POST /api/v1/evals/{eval_id}/start→{session_key, session_id, launch_url, eval_id, sub_site} -
POST /api/v1/sessions/finish— Bearer SESSION_KEY →{session_id, eval_id, sub_site, status}, idempotent