Google DiffusionGemma: 4x Faster Text Generation Open Model

Google DeepMind released DiffusionGemma on June 10, 2026, a 26B MoE open model under Apache 2.0 that generates text up to 4x faster than traditional autoregressive models by using discrete text diffusion instead of token-by-token prediction. It achieves 1,000+ tokens per second on an NVIDIA H100 GPU.

Direct Answer Block

The Real Problem

Every major language model you have used — GPT, Claude, Gemini, Llama — generates text the same way: one token at a time, left to right. Each token depends on the one before it. This creates a fundamental bottleneck. Your GPU could be processing 100 tokens in parallel, but the autoregressive constraint forces it to wait for each previous token to finish. [ STAT ] Most LLMs utilize less than 5% of GPU compute capacity during autoregressive decoding due to the memory-bandwidth bottleneck. — NVIDIA Technical Blog, 2025. The practical impact is that even the fastest autoregressive models top out at 100-200 tokens per second on consumer hardware. A 2,000-token document takes 10-20 seconds to generate. For interactive use cases — real-time editing, iterative writing, code completion — that latency is noticeable. It breaks flow. The bottleneck is not the model's intelligence. It is the architecture's sequential constraint.

What This Actually Does

DiffusionGemma throws out the sequential approach entirely. Instead of predicting one token at a time, it starts with a canvas of 256 masked tokens and denoises the entire block in parallel over multiple refinement passes. [TOOL: DiffusionGemma 26B-A4B] A Mixture-of-Experts model with 26B total parameters but only 3.8B active per token. Fits in 18GB VRAM at 4-bit. [TOOL: Uniform State Diffusion] The core technique. The model initializes all 256 token positions as masked placeholders, then iteratively refines them. Each forward pass commits 15-20 tokens with high confidence and re-refines the rest. After 48 denoising steps, the entire block is coherent text. [TOOL: NVFP4] Native 4-bit floating-point format on NVIDIA Blackwell GPUs that accelerates compute throughput with near-lossless accuracy. The result is 1,000+ tokens per second on H100 and 700+ on RTX 5090. A 2,000-token document generates in 2 seconds instead of 20. The model is also multimodal: it processes interleaved text, images, and video inputs up to 256K context, supporting 140+ languages.

Who This Is Built For

Researchers and developers building speed-critical interactive applications who need sub-second text generation on local hardware. Think real-time code completion, in-line text editing with suggestions, and conversational AI where response latency directly impacts user retention. Local AI enthusiasts running models on consumer GPUs (RTX 4090, 5090) who want GPT-4-class speed without cloud API costs. DiffusionGemma fits in 18GB VRAM at 4-bit — well within consumer GPU budgets. Enterprise teams deploying on-premise AI agents that need high-throughput text generation for document processing pipelines, report generation, or internal chatbots where latency directly correlates with employee productivity and satisfaction.

How It Runs: Step by Step

Prompt Input. The user provides a text prompt (system + user message). DiffusionGemma tokenizes the prompt. 2. Canvas Initialization. The model allocates a 256-token canvas filled with masked placeholder tokens. The target output length is determined from the prompt. 3. Forward Pass 1. The model processes the entire canvas simultaneously using bi-directional attention. It computes confidence scores for every token position. Tokens with confidence above a threshold are committed (locked in place). 4. Iterative Denoising (Steps 2-48). Each pass refines the remaining masked positions. Committed tokens serve as context for the next pass. By pass 48, typically 90%+ of tokens are committed. 5. Output Assembly. The committed tokens form the final text block. If longer output is needed, a new 256-token canvas is initialized with the previous output as context, and the process repeats. 6. Post-Processing. The output is detokenized and returned. Thanks to the parallel generation, the entire 256-token block emerges in the time an autoregressive model would take to generate 10-15 tokens.

Setup and Tools

Model available on Hugging Face at huggingface.co/collections/google/diffusiongemma. Download with the transformers library (4.80+ required). For vLLM integration, use the nightly build with diffusion support. The NVIDIA NeMo framework also provides optimized inference pipelines. Hardware requirement: minimum 18GB VRAM at 4-bit quantization. An RTX 4090 (24GB) or RTX 5090 (32GB) works. H100 or Blackwell B200 for production throughput. The gotcha: DiffusionGemma is experimental. Tooling is not plug-and-play yet. Standard autoregressive inference frameworks (llama.cpp, Ollama) do not support diffusion decoding as of release date. You will likely need to run the model via Hugging Face Transformers with custom diffusion loop code, or wait for community-optimized runners. Google explicitly states that output quality trails the autoregressive Gemma 4 counterpart — MMLU Pro 77.6 vs 82.6, GPQA 73.2 vs 82.3.

The Numbers

[ STAT ] 1,000+ tokens/sec on NVIDIA H100 — roughly 4x the throughput of autoregressive Gemma 4. [ STAT ] 700+ tokens/sec on consumer RTX 5090. [ STAT ] 26B total parameters, 3.8B active: MoE architecture keeps compute costs low. [ STAT ] Apache 2.0 license — fully open, no restrictions. [ STAT ] 256K context window with multimodal input support. (Source: Google DeepMind, June 2026)

What It Cannot Do

DiffusionGemma's output quality trails the autoregressive Gemma 4 on reasoning benchmarks (MMLU Pro, GPQA, MMMU Pro). For tasks requiring deep reasoning, use the AR version. 2. The model is experimental. Tooling (vLLM, llama.cpp, Ollama) does not yet support diffusion decoding at launch — you will need to work with raw Hugging Face Transformers. 3. Maximum output per diffusion pass is 2,048 tokens. Longer outputs require chaining passes, which adds complexity and reduces the speed advantage. 4. The speed gains are most pronounced on single-user local inference. The architecture does not automatically improve throughput on batched cloud serving.

Start in 10 Minutes

(5 min) Set up a Hugging Face account and accept the DiffusionGemma terms at huggingface.co/google/diffusiongemma-26B-A4B-it. 2. (5 min) Install the latest transformers library: pip install transformers>=4.80.0. 3. (10 min) Run the example inference script from Google's developer guide at developers.googleblog.com/en/diffusiongemma-the-developer-guide. Expect the first inference to take 30-60 seconds while the model loads into VRAM. 4. (5 min) Monitor VRAM usage with nvidia-smi. Ensure your GPU has at least 18GB free VRAM.

Frequently Asked Questions

Q: How does DiffusionGemma achieve 4x faster generation? A: It uses discrete text diffusion instead of autoregressive decoding. Instead of generating one token at a time, it denoises a 256-token canvas in parallel over 48 refinement passes.

Q: Can I run DiffusionGemma on my RTX 4090? A: Yes. At 4-bit quantization, DiffusionGemma fits in 18GB VRAM. An RTX 4090 with 24GB VRAM can run it, achieving 500-700 tokens per second.

Q: Is DiffusionGemma better than Gemma 4 for quality? A: No. Google explicitly states DiffusionGemma's output quality trails the autoregressive Gemma 4 on reasoning benchmarks. The trade-off is speed for quality.

Q: What license is DiffusionGemma released under? A: Apache 2.0 — the most permissive open-source license. No restrictions on use, modification, or commercial deployment.

Q: What hardware do I need to run DiffusionGemma locally? A: Minimum 18GB VRAM at 4-bit quantization. Recommended: RTX 4090 (24GB), RTX 5090 (32GB), or any NVIDIA GPU with 24GB+ VRAM. An H100 delivers 1,000+ tokens/sec.