Getting a large language model to think harder at inference time, a technique called test-time scaling, has become one of the more reliable ways to squeeze better answers out of AI systems. The problem is that designing those “think harder” strategies has traditionally been a manual, intuition-driven slog. Researchers tinker with heuristics, run expensive experiments, and hope they’ve found something close to optimal.
A new framework called AutoTTS, developed by researchers from Meta, Google, the University of Maryland, the University of Virginia, Washington University in St. Louis, and the University of North Carolina, takes humans largely out of that loop. The result: a roughly 69.5% reduction in token usage compared to strong handcrafted baselines, with essentially no loss in accuracy.
How AutoTTS works, and why the numbers matter
AutoTTS replaces manual process with an agentic loop. The system uses Anthropic’s Claude Code as an explorer agent to autonomously develop, test, and refine inference strategies. Instead of requiring repeated calls to the target LLM during the discovery phase, AutoTTS works from pre-collected reasoning trajectories and probe signals.
The benchmark comparison tells the story. Against SC@64, a well-known handcrafted baseline, AutoTTS achieved its 69.5% token reduction at a specific operating point (beta approximately 0.5) while matching the baseline’s mean held-out accuracy. The discovered strategies scored an average of 45.3 on held-out accuracy versus 45.2 for the baseline.
Perhaps the most striking detail is the cost. The entire strategy discovery process ran for 160 minutes and cost $39.9.
Generalization and practical applications
The AutoTTS researchers demonstrated that their discovered strategies generalize across different model scales and benchmarks.
The paper, titled “LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling” and submitted to arXiv on May 8, 2026, with a revision on May 12, has also made its code and data available on GitHub.
What this means for crypto and AI-adjacent markets
Token usage is not an abstract concern in these contexts. Every token processed costs money, adds latency, and creates a scaling bottleneck. A 69.5% reduction in token consumption, if these efficiency gains translate to production environments, could meaningfully change the economics of running AI-powered crypto infrastructure.
The risk, as always, is that benchmark performance doesn’t always survive contact with messy real-world data. Generalization claims need to be validated across the specific workloads that matter in crypto contexts, where adversarial inputs and rapidly shifting market conditions create challenges that academic benchmarks rarely capture.
Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

1 hour ago
3
















English (US) ·