ByteDance's "iLLaDA" is a diffusion language model that keeps up with…

Researchers from Renmin University and ByteDance have released iLLaDA, an 8B diffusion language model that matches Qwen2.5 at the base level but falls behind after fine-tuning. Unlike autoregressive models like GPT or Claude, which generate text word by word, diffusion language models start with masked tokens and refine them in parallel across multiple passes. The team pretrained iLLaDA on 12 trillion tokens, up from 2.3 trillion for its predecessor LLaDA, and fine-tuned it for twelve epochs. According to the paper, iLLaDA-Base improves sharply over LLaDA, jumping 21.6 points on the reasoning test BBH. On average it hits 63.9 points, edging just past the autoregressive Qwen2.5 7B at 63.3. The comparison with the competing diffusion model Dream 7B also favors iLLaDA, beating Dream on average 63.9 vs. 61.4, even without the head start of a strong autoregressive base. In June 2026, Google DeepMind released DiffusionGemma, a 25-billion-parameter mixture-of-experts model that generates text about four times faster via diffusion but scores worse on benchmarks like MMLU and code than the similarly sized autoregressive Gemma 4.