Research

Kwai Keye-VL-2.0 Technical Report

快手开源 Kwai Keye-VL-2.0-30B-A3B:面向长视频理解与智能体智能的 MoE 多模态模型

arXiv logo

Kwai Keye-VL-2.0 Technical Report

arXiv.org

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.

Open source

Recommended because

This is worth tracking because it is a concrete research signal, not just a passing headline. The source preview points to a research result, method, evaluation, dataset, or safety finding. For builders and operators, "Kwai Keye-VL-2.0 Technical Report" can be used as a checkpoint for technical due diligence, roadmap bets, agent design, and evaluation strategy. I keep this thread indexed so future searches around AI research papers, technical methods, and applied AI systems can land on a source-linked page instead of disappearing into a fast-moving feed from arXiv.org.

What to take from this signal

Context

"Kwai Keye-VL-2.0 Technical Report" is archived here as a source-linked AI signal from arXiv.org. The useful part is the connection between Kwai, Keye-VL-2, Technical, Report, introduce and technical due diligence, roadmap bets, agent design, and evaluation strategy, which makes the item more actionable than a normal feed headline. The source context says: We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.

Builder takeaway

For an AI builder, the main takeaway is to watch how this signal changes practical decisions around technical feasibility, evaluation design, safety limits, and product primitives. It can inform what to test next, which product surface to compare, and whether the underlying workflow is ready for real users.

Source context

arXiv.org remains the authoritative source for the original claim. This page adds a stable archive URL, a short builder interpretation, and related search language so the item can be found later when the original feed has moved on.

Search angles

  • Kwai Keye-VL-2.0 Technical Report Research context
  • arXiv.org AI research
  • Kwai, Keye-VL-2, Technical, Report, introduce builder takeaway
  • AI research papers, technical methods, and applied AI systems

This page keeps a source preview and a stable archive URL for search discovery. The original source remains authoritative.