From Agents to Systems: The Real Future of AI

Scaling, memory, self-improvement, and evaluation—what actually makes AI work

Mar 26, 2026

∙ Paid

Over the past year, most discussions around AI have focused on bigger models and more agents, assuming that scale naturally leads to better performance. But a closer look at recent research tells a very different story. Across four important papers, a new pattern emerges: AI systems don’t fail because they lack intelligence, they fail because of how they are structured. Adding more agents often makes things worse due to coordination overhead, memory remains fragmented and poorly governed, systems struggle to improve themselves because their learning mechanisms are fixed, and even when they generate good outputs, they fail to properly evaluate them. This piece breaks down these insights and shows what actually matters when building real AI systems—how to design coordination, structure memory, enable self-improvement, and separate thinking from evaluation—so that AI doesn’t just scale, but actually works.

More Agents, Worse Results? The Hidden Truth About Scaling AI Systems

> Agents are LLM-based systems that reason, plan, and act through repeated interaction with an environment.
> This work measures when multiple agents help and when they hurt, using controlled experiments across 180 settings.

What counts as an agentic task
> Agentic tasks need multi-step interaction, partial observability (you don’t see everything at once), and strategy updates based on feedback.
> Static benchmarks (one-shot questions) are not agentic and give misleading guidance about multi-agent value.

Experimental setup (short)
> Five architectures compared: single-agent and four multi-agent topologies (independent, centralized, decentralized, hybrid).
> Four agentic benchmarks: web browsing, finance analysis, game planning, realistic workplace tasks.
> Models from three families and matched token budgets so differences reflect architecture, not implementation.

Core, actionable findings
> Multi-agent benefits are highly task-dependent; team size alone does not guarantee improvement.
> Tool-heavy tasks suffer more from coordination overhead under the same compute budget.
> If a single-agent baseline is above about 45% accuracy, adding coordination usually hurts.
> Architecture matters: independent agents amplify errors massively (about 17×), centralized systems contain amplification (~4.4×).
> Centralized coordination helped parallelizable finance tasks (~+80%), decentralized helped dynamic web navigation (~+9%), but every multi-agent variant hurt strictly sequential planning (−39% to −70%).

Three practical scaling principles
> Tool–coordination trade-off: when many tools are needed, per-agent token budgets get squeezed and coordination costs dominate.
> Capability saturation: once a single agent reaches ~45% correct, coordination returns diminish or become negative.
> Topology-dependent error amplification: without an aggregator or verifier, individual mistakes cascade into the final output.

From Agents to Systems: The Real Future of AI

Scaling, memory, self-improvement, and evaluation—what actually makes AI work

More Agents, Worse Results? The Hidden Truth About Scaling AI Systems

This post is for paid subscribers