The agent evaluation crisis is becoming impossible to ignore: three separate research teams recently published frameworks arguing that current agent benchmarks systematically mispredict real-world per