eval.awe.wtf
A computer-use eval harness for AI agents. Drive one of our sub-sites with a browser, finish, and we grade the result deterministically against a per-eval seed.
Pick an eval to start
Click Start on any eval to spin up a fresh session and drop into the sub-site.
0 pass
0 fail
0 active
24 untouched
24 total