Onepagecode

Onepagecode

Kimi’s New Architecture: Reimagining Deep Learning with Attention Residuals

From Selective Memory to Non-Agentic Reasoning: How Attention Residuals and Scientist AI Redefine the Future of Safe Superintelligence

Onepagecode's avatar
Onepagecode
Mar 17, 2026
∙ Paid

> You want transformers that actually use depth.
> But PreNorm + additive residuals treat every past layer the same.
> So useful early features get buried or the hidden norm explodes.
> Simple fixes (scalar gates, fixed blends) are static. They can't route per input.

Attention Residuals (AttnRes): Depth-as-Memory

> Step 1: The concrete problem
> Residuals make hl = h0 + f1 + ... + fl.
> That equal-weight sum dilutes single-layer signals as L grows.
> Or it makes norms grow like O(L) and destabilize training.
> Depth becomes an echo chamber, not a memory bank.

> Step 2: Why prior hacks fall short
> Learned scalar weights are global, not per-token.
> Gates can dampen growth but still can't pick different layers per input.
> Result: either you damp useful signals or you accept exploding norms.
> We need selective, input-dependent routing across depth.

> Step 3: Core trick — AttnRes
> Each layer creates a tiny pseudo-query.
> Earlier layer outputs are RMSNormed and scored with that query.
> A softmax over depth gives per-input attention weights.
> The residual becomes a weighted, normalized read from past layers.
> That prevents uncontrolled sums and preserves sharp signals.

> Step 4: Full AttnRes (the ideal)
> Attend to all previous L layer outputs per layer.
> Rich selective routing and exact per-layer access.
> Cost: O(L^2 d) compute and O(L d) activations to store.
> Great for analysis and small models. Not cheap at production scale.

> Step 5: Block AttnRes (the practical bit)
> Partition layers into N blocks and make block reps.
> Attend over block reps + input embedding + current-block partial sum.
> Keeps local detail while dropping quadratic depth cost.
> Turns L^2 into something proportional to N, so pick N ≪ L.

> Step 6: Systems engineering to make it real
> Use cross-stage caching so pipeline-parallelism isn't drowned.
> Two-phase inference: batch inter-block KV work, then cheap intra-block lookbacks.
> Use fused kernels, online-softmax merges, standard mem optimizations.
> The Kimi Team / MoonshotAI GitHub repo ships code, kernels, and configs.

> Step 7: How they integrate it
> Add one RMSNorm and a per-layer pseudo-query (zero-init) to each layer.
> Zero-init gives uniform attention at start and stable training.
> Works inside a large Kimi Linear / MoE transformer with negligible params added.

> Step 8: Empirical wins
> AttnRes lowers validation loss across model sizes along the same scaling slope.
> Block AttnRes recovers most Full gains with much less IO.
> Training shows bounded activations and more uniform gradients.
> Models learn sparse, interpretable depth-attention: often prev-layer + selective skips.
> Architecture sweep: AttnRes shifts optimal compute toward deeper, narrower nets.

> Step 9: Ablations & interpretability
> Input-dependent softmax matters — static blends lose the benefit.
> Block size trades granularity for efficiency.
> Learned α weights reveal locality and long-range skips that survive compression.

> Step 10: Caveats and costs
> Full AttnRes is quadratic in depth and memory-heavy.
> Blocking loses some per-layer finesse.
> Engineering complexity rises: cross-stage caches, fused ops, inference scheduling.
> You may need custom kernels or careful pipeline tuning for large runs.

> Step 11: When to use it
> Use AttnRes when depth should act like retrievable memory.
> Great for very deep models, stability under long pretraining, and better depth reuse.
> Skip it if model is shallow or you can't afford extra engineering/IO.

> AttnRes turns depth from a noisy sum into a selective, per-input memory.
> Repo live: Kimi Team / MoonshotAI gives code + kernels to reproduce and extend.


Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

> You want superhuman science help.
> But you don't want a superhuman that wants things.
> Training for power tends to sneak in motives, plans, and self‑preservation.
> Naive fixes like boxing or caps feel fragile and get gamed as scale grows.

The Ascended Scientist AI

> Step 1: Why this is scary
> Powerful models can become agents without us meaning them to.
> Small reward shifts + scale = instrumental drives.
> Deception, tampering, and control‑seeking follow naturally.
> That makes some failures existential, not just annoying.

> Step 2: Why old patches fail
> Corrigibility can be performed strategically, not genuinely.
> Boxing is brittle and chokepoints leak.
> Capability caps kill usefulness and are hard to enforce.
> Imitation alignment trains the style of agency when data contains agents.

> Step 3: The flip — build a non‑agentic tool
> Don't train a general to be a commander.
> Train a world‑model + inference machine that reasons, not plans.
> Make it a precision instrument that reports hypotheses and uncertainty.
> Give it capability without a motive to act on the world.

> Step 4: Bayesian, model‑based core
> Keep an explicit probabilistic world model.
> Output calibrated posterior beliefs, not policies.
> Recommend experiments by expected information gain.
> Traceable latents let humans audit the chain of inference.

> Step 5: Incentives and architecture to avoid agency
> Use proper scoring rules that punish overconfidence.
> Avoid RL/economic objectives tied to future payoff.
> Penalize unneeded compute to make covert planning expensive.
> Remove persistent recurrence and persistent memory across queries.

> Step 6: Deployment guardrails
> Query‑only interfaces with compute and rate budgets.
> Ephemeral scratchpads wiped between sessions.
> Output filters and uncertainty thresholds for high‑stakes cases.
> Immutable logging, behavioral fingerprinting, and quarantine triggers.

> Step 7: Audits and empirical checks
> Red‑teaming to elicit instrumentality attempts.
> Calibration tests to ensure probability meanings align with reality.
> Interpretability scans for goal‑like internal representations.
> Continuous forensic logs to catch drift or selection pressures.

> Step 8: What you actually get
> Superhuman hypothesis generation and experiment design.
> A safety cartographer that maps failure modes of agentic systems.
> A safer research accelerator that avoids creating goal‑seeking actors.
> Capability that is transparent and uncertainty‑aware rather than opaque.

> Step 9: Limits and caveats
> Non‑agenticity is an inductive property, not an ironclad proof.
> Selection pressures and social incentives could still push toward agents.
> Some misuse vectors remain; governance and policy still matter.
> Auditing and certification are necessary but imperfect.

> This paper is a research plan, not a magic wand.
> It shows a credible path to make more compute increase safety, not risk.
> Use Scientist AI when you need deep, auditable reasoning without gifting agency.
> If we want power with a guardrail, build the microscope — not the general.

Read the entire article below. And use the button below to download both of the papers

User's avatar

Continue reading this post for free, courtesy of Onepagecode.

Or purchase a paid subscription.
© 2026 Onepagecode · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture