AI Models

Sumi: Open Uniform Diffusion Language Model from Scratch

Sumi:从头训练的7B开源均匀扩散语言模型

arXiv logo

Sumi: Open Uniform Diffusion Language Model from Scratch

arXiv.org

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

Open source

Recommended because

This is worth tracking because it is a concrete model capability signal, not just a passing headline. The source preview points to a change in model capability, availability, benchmark behavior, or developer access. For builders and operators, "Sumi: Open Uniform Diffusion Language Model from Scratch" can be used as a checkpoint for model selection, product roadmaps, eval planning, and timing decisions. I keep this thread indexed so future searches around AI model updates, capability shifts, and developer adoption can land on a source-linked page instead of disappearing into a fast-moving feed from arXiv.org.

What to take from this signal

Context

"Sumi: Open Uniform Diffusion Language Model from Scratch" is archived here as a source-linked AI signal from arXiv.org. The useful part is the connection between Sumi, Open, Uniform, Diffusion, Language and model selection, product roadmaps, eval planning, and timing decisions, which makes the item more actionable than a normal feed headline. The source context says: Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

Builder takeaway

For an AI builder, the main takeaway is to watch how this signal changes practical decisions around model quality, latency, cost, eval coverage, and release timing. It can inform what to test next, which product surface to compare, and whether the underlying workflow is ready for real users.

Source context

arXiv.org remains the authoritative source for the original claim. This page adds a stable archive URL, a short builder interpretation, and related search language so the item can be found later when the original feed has moved on.

Search angles

  • Sumi: Open Uniform Diffusion Language Model from Scratch AI Models context
  • arXiv.org AI model releases
  • Sumi, Open, Uniform, Diffusion, Language builder takeaway
  • AI model updates, capability shifts, and developer adoption

This page keeps a source preview and a stable archive URL for search discovery. The original source remains authoritative.