On June 10, 2026, Google DeepMind published DiffusionGemma: the first large-scale text model based on diffusion architecture released with fully open weights under the Apache 2.0 license. The model breaks the token-by-token generation paradigm used by every major language model since GPT-2. Instead of predicting one word at a time, DiffusionGemma generates entire 256-token blocks in parallel, achieving up to 4x the generation speed of an equivalent autoregressive model in single-user workloads. This makes it the fastest freely available model for local hardware deployment, including consumer GPUs like the NVIDIA GeForce RTX 5090. NVIDIA simultaneously published its own technical blog confirming RTX AI Garage, DGX Spark, and RTX PRO optimization — a formal collaboration that underscores the model's readiness for production deployment without cloud dependency.
What Did Google DeepMind Announce with DiffusionGemma?
DiffusionGemma is a 26-billion-parameter Mixture-of-Experts (MoE) model built on the Gemma 4 architecture, activating only 3.8B parameters per inference step. Its text diffusion mechanism replaces sequential prediction with dense block generation, yielding extraordinarily low latencies. The published benchmarks are concrete: 700+ tokens/second on an NVIDIA GeForce RTX 5090, 1,000 tokens/second on a single NVIDIA H100 Tensor Core GPU, 150 tokens/second on an NVIDIA DGX Spark, and up to 2,000 tokens/second on an NVIDIA DGX Station — approximately 4x faster than a comparable autoregressive model. When quantized, the model requires only 18 GB of VRAM, putting it within reach of high-end consumer cards. Availability is immediate: weights on Hugging Face with day-zero support for Transformers, vLLM, Unsloth, MLX, and NVIDIA NeMo. The Apache 2.0 license removes all commercial-use restrictions, meaning any company can integrate it into products and services without royalties.
"An enterprise-grade AI model that runs on your own GPU — no per-token bill, no data leaving your premises — is exactly the scenario SMBs have been waiting for. DiffusionGemma makes it real today."
Davarion Group & LabsReal Impact for SMBs
- 01Zero per-token cost: running locally on an RTX 4090 or RTX 5090, a business processes millions of tokens without paying API fees. A DiffusionGemma automation server can replace $500–$3,000/month in OpenAI or Anthropic API subscriptions for text-heavy workloads.
- 024x speed boost for agentic workflows: customer service chatbots, bulk product description generation, contract summarization, and email processing all run in real time. At 1,000 tokens/sec on H100 and 700 on RTX 5090, a 500-word response takes under 0.5 seconds.
- 03Complete data privacy: no external API calls means confidential business data — invoices, contracts, client emails — never leaves your server. This is critical for regulated industries such as legal, healthcare, and finance.
- 04Immediate recommended action: download DiffusionGemma from Hugging Face, benchmark it with vLLM on an RTX GPU, and map which internal text-generation workflows can migrate from cloud API to local inference within the next 30 days.
DiffusionGemma redefines the economics of AI automation for mid-sized businesses. Until now, enterprise-grade inference speed required expensive cloud infrastructure or API agreements billed per token generated. With this release, Google DeepMind — in direct collaboration with NVIDIA — democratizes access to near-instantaneous text generation that runs on an on-premise server or an advanced workstation. The Apache 2.0 license eliminates commercial-use restrictions entirely. Day-zero support in vLLM and Hugging Face Transformers means technical teams can deploy it today using tools they already know. For SMBs that generate large volumes of text — proposals, customer communications, document analysis, automated reporting — this release marks the moment local AI became faster, cheaper, and more private than any cloud alternative.
At Davarion Group & Labs we design and deploy autonomous AI agents for small and medium-sized businesses in Houston, TX and throughout Latin America. With the release of DiffusionGemma, we can now build high-speed text automation solutions that run entirely on the client's own infrastructure — eliminating recurring API costs and ensuring full data privacy. If your business processes large volumes of text — proposal generation, customer service, document analysis, or automated reporting — reach out at davarion.com to explore how DiffusionGemma can reduce your AI operating cost by up to 90% while multiplying processing speed.