Research

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

混合LLM中的注意力失忆:CoT微调破坏长距离召回及修复方法

arXiv logo

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

arXiv.org

Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from $67.2\%$ to $9.4\%$. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections ($W_Q, W_K$) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only $W_Q$ and $W_K$ from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from $65.4\%$ to $76.4\%$ while maintaining strong reasoning performance.

Open source

Recommended because

This is worth tracking because it is a concrete research signal, not just a passing headline. The source preview points to a research result, method, evaluation, dataset, or safety finding. For builders and operators, "Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It" can be used as a checkpoint for technical due diligence, roadmap bets, agent design, and evaluation strategy. I keep this thread indexed so future searches around AI research papers, technical methods, and applied AI systems can land on a source-linked page instead of disappearing into a fast-moving feed from arXiv.org.

What to take from this signal

Context

"Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It" is archived here as a source-linked AI signal from arXiv.org. The useful part is the connection between Attention, Amnesia, Hybrid, LLMs, When and technical due diligence, roadmap bets, agent design, and evaluation strategy, which makes the item more actionable than a normal feed headline. The source context says: Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from $67.2\%$ to $9.4\%$. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections ($W_Q, W_K$) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only $W_Q$ and $W_K$ from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from $65.4\%$ to $76.4\%$ while maintaining strong reasoning performance.

Builder takeaway

For an AI builder, the main takeaway is to watch how this signal changes practical decisions around technical feasibility, evaluation design, safety limits, and product primitives. It can inform what to test next, which product surface to compare, and whether the underlying workflow is ready for real users.

Source context

arXiv.org remains the authoritative source for the original claim. This page adds a stable archive URL, a short builder interpretation, and related search language so the item can be found later when the original feed has moved on.

Search angles

  • Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It Research context
  • arXiv.org AI research
  • Attention, Amnesia, Hybrid, LLMs, When builder takeaway
  • AI research papers, technical methods, and applied AI systems

This page keeps a source preview and a stable archive URL for search discovery. The original source remains authoritative.