Google Gemma 4: The Most Capable Open AI Model You Can Run Yourself
Google DeepMind released Gemma 4 on April 2, four open-source models under Apache 2.0 that range from Raspberry Pi to datacenter scale. The 2.3B model beats its 27B predecessor. Here is what matters for developers and businesses.

Introduction
A 2.3 billion parameter model that outperforms its 27 billion parameter predecessor. That is the headline number from Google Gemma 4, released on April 2, 2026. But the real story is not a single benchmark. It is that Google just open-sourced a family of four AI models, from Raspberry Pi scale to datacenter scale, under Apache 2.0. No restrictions on commercial use. No special agreements. Download and deploy.
The Gemma 4 family comes from the same research and technology stack as Gemini 3, Google's flagship closed model. That makes this the closest Google has come to giving away its best work. For businesses exploring local AI deployment, self-hosted inference, or agentic workflows that need to run on-premise, this release changes the math.
Four Models, From Phone to Server Rack
Gemma 4 ships as four distinct models, each targeting a different hardware profile. E2B has 2.3 billion effective parameters, supports 128K token context, and handles text, images, and audio. It runs on smartphones, IoT devices, and Raspberry Pis. E4B doubles the parameters to 4.5 billion with the same 128K context and multimodal support, targeting edge devices and laptops.
The 26B model uses a Mixture-of-Experts (MoE) architecture with only 3.8 billion active parameters at any given time, despite its 26 billion total. This gives it the intelligence of a much larger model at the inference cost of a small one. It supports 256K token context. The flagship 31B dense model packs 30.7 billion parameters with 256K context and ranks third among all open models on the Arena AI leaderboard with a score of 2150 on LMArena.
All four models handle text, images, video, and audio inputs natively. All support function-calling for agentic workflows out of the box. And all are released under Apache 2.0, which means you can modify them, fine-tune them, and ship them in commercial products without licensing fees or usage restrictions.
The Benchmarks That Matter
Numbers without context are noise. So here are the comparisons that tell an actual story. On GPQA Diamond, a graduate-level science reasoning benchmark, the 31B model scores 85.7% and the 26B model scores 79.2%. On AIME 2026 math, the 31B scores 89.2% and the 26B hits 88.3%. Compare that to Gemma 3 27B at 20.8% on the same test. The generational improvement is not incremental. It is a category shift.
Tool use tells a similar story. On the retail benchmark from the tau-2 suite, the 31B model scores 86.4%. Gemma 3 27B scored 6.6% on the same test. This matters because tool use is the core capability for agentic AI: an AI model that can call functions, query APIs, and chain actions together to solve multi-step problems.
The E2B model deserves its own highlight. At 2.3 billion effective parameters, it beats Gemma 3 27B on most benchmarks despite being roughly one-tenth the size. Google CEO Sundar Pichai described it as packing "an incredible amount of intelligence per parameter." In multilingual performance, the models outperform Qwen 3.5 in German, Arabic, Vietnamese, and French, relevant for businesses operating across Europe and beyond.
What the Community Found After 24 Hours
No launch is complete without real-world testing. Within 24 hours of release, the developer community identified both strengths and limitations. The E2B model's efficiency received widespread praise. Running a capable multimodal model on a basic laptop or Raspberry Pi was previously not feasible. Now it is, and the practical use cases for edge deployment expand significantly.
The concerns centered on the MoE model. Community benchmarks showed it running at roughly 11 tokens per second versus 60+ for Qwen 3.5's equivalent model. That speed gap matters for interactive applications. The dense 31B model clocked 18 to 25 tokens per second on dual consumer GPUs, acceptable for most use cases but below the faster closed alternatives.
VRAM consumption was also flagged as higher than expected, particularly for long context windows. And developers attempting to fine-tune the models with QLoRA reported tooling friction with Google's new training configuration requirements. These are launch-day issues that tend to improve rapidly, but they are worth noting for teams planning immediate deployments.
Why Apache 2.0 Changes Everything
Previous Gemma versions shipped under a more restrictive license that limited certain commercial applications. Gemma 4 ships under Apache 2.0, the same license used by Kubernetes, Airflow, and most of the modern open-source infrastructure stack.
The practical impact is immediate. You can download Gemma 4, fine-tune it on your proprietary data, embed it in your product, and sell that product without paying Google or signing an agreement. You can modify the model weights, create derivative works, and distribute them. The only requirement is attribution.
For businesses that have been wary of closed AI APIs because of vendor lock-in, data privacy, or unpredictable pricing, this is the strongest alternative yet. Run it on your own servers. Keep your data on-premise. Pay for compute, not per-token API fees. The total cost of ownership math for many workloads tips in favor of self-hosted when the model quality reaches this level.
What We See at MG Software
At MG Software, we currently use a mix of cloud API models for different tasks. Gemma 4 does not replace that strategy, but it adds a powerful new option for specific scenarios.
The E2B model is interesting for on-device features in mobile and progressive web apps. Classification, intent detection, and simple summarization tasks that currently require an API call could run locally, eliminating latency and API costs entirely. For progressive web apps that need offline AI capabilities, this was previously not realistic.
The 26B MoE model hits a sweet spot for businesses that want self-hosted AI but cannot justify datacenter-grade hardware. A single consumer GPU running a 256K context window model with function-calling support opens the door to local code assistants, document analysis, and customer-facing chat that never leaves your infrastructure. For clients with strict data residency requirements, especially in healthcare, legal, and government sectors, this is the answer to the question "can we use AI without sending our data to a third party?"
If your team is evaluating whether local or self-hosted AI makes sense for your use case, get in touch. The cost and capability threshold shifted this week.
Conclusion
Google Gemma 4 is not just another open model release. It is the point where open-source AI reaches genuine production quality across multiple scales, from edge devices to server deployments, with no licensing strings attached. The benchmarks speak for themselves. A 2.3B model outperforming last generation's 27B model is the kind of efficiency gain that reshapes what is possible.
For development teams, the takeaway is practical: test Gemma 4 against your current workloads. For classification, function-calling, and multilingual tasks, it may already be good enough to replace API calls. For self-hosted deployment, the Apache 2.0 license removes the last barrier. The open-source AI gap is closing faster than most people expected.

Jordan
Co-Founder
Related posts

Claude Code Source Leak: What 512,000 Lines of TypeScript Reveal About AI Coding Agents
On March 31, Anthropic accidentally published the complete Claude Code source code via npm. From self-healing memory to undercover mode, here is what 1,906 leaked files reveal about how modern AI coding agents work under the hood.

Microsoft Builds Its Own AI Models and Distances Itself from OpenAI
Microsoft launched three in-house AI models on April 2, built by teams of fewer than 10 engineers each. After investing $13 billion in OpenAI, Microsoft is now building competing products. Here is what that shift means for businesses on Azure.

AI Agents Are Becoming Infrastructure: Three Signals from One Week
JetBrains launched Central, ARM shipped its first chip ever, and Google cut AI memory usage by 6x. Three events in four days that reveal where software development is heading.

Anthropic's Code Review Tool: Why AI-Generated Code Needs AI Review
Anthropic launched a dedicated code review tool to handle the flood of AI-generated pull requests. We analyze what it does, why it matters, and how it fits into modern development workflows.








