ByteDance Seed releases EdgeBench to evaluate AI learning over 12-hour tasks

23 hours ago 3

Most AI benchmarks test what a model already knows. ByteDance Seed just released one that tests whether a model can get smarter while it works.

EdgeBench, released on July 2 by ByteDance’s AI research division, is a new evaluation framework built around 134 long-duration tasks that force AI agents to operate continuously for 12 to over 72 hours. The point isn’t to measure how much an AI knows on day one. It’s to measure how much better it gets by hour 50.

What EdgeBench actually measures

The benchmark spans six domains: scientific and machine learning problems, systems engineering, combinatorial optimization, professional knowledge work, formal mathematics, and interactive games. Expert humans averaged 57.2 hours per task to complete them, with the most demanding tasks requiring up to 320 hours of effort.

The research team analyzed nearly 38,000 hours of agent-environment interactions across multiple frontier models, including Claude Opus 4.8 and GPT-5.5. The team found that agent performance during these extended sessions follows a log-sigmoid scaling relationship with a coefficient of determination (R²) of 0.998. It means AI learning curves during long tasks are remarkably predictable, not chaotic.

The learning speed of frontier agents has been doubling roughly every three months, based on model releases between September 2025 and May 2026.

Why this matters beyond the lab

EdgeBench is ByteDance Seed’s answer to that gap, and it’s part of their broader “Seed Edge” initiative focused on general intelligence research. The benchmark doesn’t just grade final answers. It tracks the trajectory of improvement, essentially measuring whether an agent develops something resembling on-the-job competence.

The log-sigmoid scaling law suggests there’s a predictable curve to this learning, which could help developers estimate when an agent will hit diminishing returns on a given task.

ByteDance Seed has publicly released 51 of the 134 tasks along with the complete evaluation framework. Holding back the remaining 83 tasks likely serves as a hedge against benchmark contamination, the ever-present risk that models get trained specifically to ace a public test rather than genuinely improving their capabilities.

The crypto and AI investment angle

The finding that frontier agents double their learning speed every three months is directly relevant to anyone betting on AI agent infrastructure, whether that’s centralized platforms or decentralized alternatives built on blockchain rails.

An R² of 0.998 means developers and investors can model expected agent performance with unusual precision. For AI-crypto projects promising autonomous agent capabilities, EdgeBench creates an objective yardstick that didn’t previously exist.

The benchmark reveals that even frontier models from leading labs still require enormous compute budgets for extended tasks. Nearly 38,000 hours of runtime across the study is a staggering resource commitment. Decentralized compute networks promising cheap, distributed inference will need to reconcile their economics with the reality that serious AI agent work isn’t a quick API call. It’s a multi-day, resource-intensive process that demands reliability and uptime guarantees most decentralized networks haven’t yet proven they can deliver.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

Read Entire Article