DiffusionGemma: Google DeepMind's Open-Source Model Generates Text 4x Faster with Zero Per-Token Cost
Back to blog
AI Automation 7 min 668 wordsJune 11, 2026

DiffusionGemma: Google DeepMind's Open-Source Model Generates Text 4x Faster with Zero Per-Token Cost

Google DeepMind released DiffusionGemma on June 10, 2026 — an Apache 2.0 open-weights 26B MoE model that generates text in parallel 256-token blocks, reaching 1,000 tokens/sec on a single NVIDIA H100 and 700+ on an RTX 5090, with no cloud dependency and no per-token cost.

SEE LIVE DEMOS

On June 10, 2026, Google DeepMind published DiffusionGemma: the first large-scale text model based on diffusion architecture released with fully open weights under the Apache 2.0 license. The model breaks the token-by-token generation paradigm used by every major language model since GPT-2. Instead of predicting one word at a time, DiffusionGemma generates entire 256-token blocks in parallel, achieving up to 4x the generation speed of an equivalent autoregressive model in single-user workloads. This makes it the fastest freely available model for local hardware deployment, including consumer GPUs like the NVIDIA GeForce RTX 5090. NVIDIA simultaneously published its own technical blog confirming RTX AI Garage, DGX Spark, and RTX PRO optimization — a formal collaboration that underscores the model's readiness for production deployment without cloud dependency.

On June 10, 2026, Google DeepMind published DiffusionGemma: the first large-scal

What Did Google DeepMind Announce with DiffusionGemma?

DiffusionGemma is a 26-billion-parameter Mixture-of-Experts (MoE) model built on the Gemma 4 architecture, activating only 3.8B parameters per inference step. Its text diffusion mechanism replaces sequential prediction with dense block generation, yielding extraordinarily low latencies. The published benchmarks are concrete: 700+ tokens/second on an NVIDIA GeForce RTX 5090, 1,000 tokens/second on a single NVIDIA H100 Tensor Core GPU, 150 tokens/second on an NVIDIA DGX Spark, and up to 2,000 tokens/second on an NVIDIA DGX Station — approximately 4x faster than a comparable autoregressive model. When quantized, the model requires only 18 GB of VRAM, putting it within reach of high-end consumer cards. Availability is immediate: weights on Hugging Face with day-zero support for Transformers, vLLM, Unsloth, MLX, and NVIDIA NeMo. The Apache 2.0 license removes all commercial-use restrictions, meaning any company can integrate it into products and services without royalties.

DiffusionGemma is a 26-billion-parameter Mixture-of-Experts (MoE) model built on
"

"An enterprise-grade AI model that runs on your own GPU — no per-token bill, no data leaving your premises — is exactly the scenario SMBs have been waiting for. DiffusionGemma makes it real today."

Davarion Group & Labs

Real Impact for SMBs

  • 01Zero per-token cost: running locally on an RTX 4090 or RTX 5090, a business processes millions of tokens without paying API fees. A DiffusionGemma automation server can replace $500–$3,000/month in OpenAI or Anthropic API subscriptions for text-heavy workloads.
  • 024x speed boost for agentic workflows: customer service chatbots, bulk product description generation, contract summarization, and email processing all run in real time. At 1,000 tokens/sec on H100 and 700 on RTX 5090, a 500-word response takes under 0.5 seconds.
  • 03Complete data privacy: no external API calls means confidential business data — invoices, contracts, client emails — never leaves your server. This is critical for regulated industries such as legal, healthcare, and finance.
  • 04Immediate recommended action: download DiffusionGemma from Hugging Face, benchmark it with vLLM on an RTX GPU, and map which internal text-generation workflows can migrate from cloud API to local inference within the next 30 days.

DiffusionGemma redefines the economics of AI automation for mid-sized businesses. Until now, enterprise-grade inference speed required expensive cloud infrastructure or API agreements billed per token generated. With this release, Google DeepMind — in direct collaboration with NVIDIA — democratizes access to near-instantaneous text generation that runs on an on-premise server or an advanced workstation. The Apache 2.0 license eliminates commercial-use restrictions entirely. Day-zero support in vLLM and Hugging Face Transformers means technical teams can deploy it today using tools they already know. For SMBs that generate large volumes of text — proposals, customer communications, document analysis, automated reporting — this release marks the moment local AI became faster, cheaper, and more private than any cloud alternative.

DiffusionGemma redefines the economics of AI automation for mid-sized businesses

At Davarion Group & Labs we design and deploy autonomous AI agents for small and medium-sized businesses in Houston, TX and throughout Latin America. With the release of DiffusionGemma, we can now build high-speed text automation solutions that run entirely on the client's own infrastructure — eliminating recurring API costs and ensuring full data privacy. If your business processes large volumes of text — proposal generation, customer service, document analysis, or automated reporting — reach out at davarion.com to explore how DiffusionGemma can reduce your AI operating cost by up to 90% while multiplying processing speed.

At Davarion Group & Labs we design and deploy autonomous AI agents for small and
#DiffusionGemma#Google DeepMind#local AI#NVIDIA RTX#open source AI model

Davarion Group & Labs

WANT TO SEE THE AI IN ACTION?

Try an AI chatbot configured with your business name — live, no signup required.

Davarion Logo

DAVARION

THE HARMONY OF CREATION

Transformando negocios a través de la automatización inteligente y soluciones tecnológicas de vanguardia.

Contacto

  • ceo@davarion.com
  • +1 (346) 865-6734
  • Global • Remote First

© 2026 Davarion Group and Labs. Todos los derechos reservados.