Gemma 4: Google's Multimodal Leap in Openweight AI

Google has once again entered the openweight model arena with Gemma 4, building upon the success of its predecessors. While previous versions like Gemma 3 were lauded for their density and performance, Gemma 4 significantly ups the ante by introducing multimodal capabilities to even its smaller models. This new iteration can process images, video, and audio, alongside an impressive context window of a quarter of a million tokens.

Enhanced Capabilities and Architecture

Gemma 4 boasts a range of improvements, including enhanced reasoning, diverse and efficient architectures, and robust coding and semantic abilities. The models come in three dense variants, optimized for smaller devices and capable of handling audio and video inference. Additionally, a Mixture of Experts (MoE) model is available, promising super-fast inferencing speeds.

Google's internal benchmarks show a dramatic improvement over previous versions. For instance, performance on the Big Bench Hard benchmark has surged from around 20% to 89% with the new models. Even the smaller 2 billion parameter text model, when enhanced with image embeddings, shows performance exceeding that of Gemma 3. The models also support function calling and multilingual capabilities across 35 languages.

Multimodal Performance: Image and Audio Analysis

The true innovation of Gemma 4 lies in its multimodal prowess. Testing with a CT scan of a brain tumor demonstrated remarkable accuracy. The largest model correctly identified the serious finding, stating it was a "ring like lesion surrounding edema of serious finding that requires urgent medical attention." Even when switched to the MoE edition, the model accurately identified a "metastatic tumor," with token inspector data revealing high confidence scores for brain abscess and high-grade glioma.

The smaller 4 billion parameter model also showed promise, identifying "acute subacute hemorrhage bleeding" in the brain. Even the 2 billion parameter model, which quadruples to 4 billion with embeddings, managed to suggest the presence of a tumor in a brain scan, showcasing its ability to process visual information effectively.

Beyond medical imaging, Gemma 4 excels at general image analysis. When presented with an image containing text like "local AI bot turbo quant open claw 100% working," the 2 billion parameter model accurately extracted all the text, demonstrating its OCR capabilities.

The smaller models also unlock audio analysis, a feature not typically available in larger models of this class. When fed an audio clip, Gemma 4 could accurately identify the speaker's gender with 100% certainty. Furthermore, it could transcribe speech and even follow commands embedded within the audio. For example, when asked "What is the meaning of life?" via audio, the model provided a response, though the initial smaller version did not incorporate the requested LaTeX formatting. The larger MoE model, however, successfully generated a response with LaTeX.

Code Generation and Application Replication

Gemma 4's coding and generative capabilities are equally impressive. When presented with a screenshot of an application, the model could generate Swift code to replicate its functionality. The 31 billion parameter model produced a comprehensive Swift UI implementation, including a sidebar, chat scroll view, and message sending functionality, even identifying the model used (Gemma 4 1B at 9 bits).

Switching to HTML generation for web browser compatibility, the faster MoE model generated a complete web application structure, including a sidebar and chat interface, in just over 3,000 tokens. While the generated HTML included a placeholder image that was not part of the original application, the overall structure and UI elements were remarkably well-recreated.

Logical Reasoning and Problem Solving

Gemma 4 demonstrates strong logical reasoning abilities. When asked whether to drive or walk 50 meters to a car wash, the model correctly deduced that driving would be the appropriate action. Even the smaller models showed a good grasp of this simple logic, though the very smallest model suggested walking might be more efficient for that distance, indicating a slight loss of nuance at the lowest tier.

The models also tackled traditional riddles with impressive accuracy. When presented with a modified trolley problem where five people were already dead, the model correctly advised against pulling the lever, showcasing its ability to understand contextual nuances.

Tool Use and Web Interaction

The ability to interact with external tools is another significant advancement. Gemma 4 successfully made tool calls to an AI image generator. When asked to retrieve information from a Google Wikipedia article, the larger model effectively used a "get web page content" tool, paginating through results to find the answer. The smaller model, however, required a more specific article title to initiate the tool call, highlighting the difference in capabilities between model sizes.

Coding Prowess: Web Page Generation

In a final coding challenge, Gemma 4 was tasked with creating a high-fidelity interactive web page. Both the dense and MoE versions produced substantial code. The smaller, faster MoE model generated a visually appealing solar system simulation, complete with interactive elements like a controllable asteroid that could form a ring or a moon. The larger dense model, after an initial runtime error, was able to fix its own code and render a similar simulation, even adding a controllable spaceship with fluid movement and a follow-on camera. This self-correction capability is particularly noteworthy.

Conclusion: A Powerful Openweight Contender

Gemma 4 represents a significant step forward for Google in the openweight AI landscape. The integration of multimodal capabilities across various model sizes, coupled with enhanced reasoning, coding, and tool-use abilities, makes it a formidable contender. While the release of larger models is still anticipated, the current offerings provide a compelling glimpse into the future of accessible, powerful AI. The team's commitment to open releases is commendable, and the ongoing development promises even more exciting advancements.

Key Takeaways

Gemma 4 introduces multimodal capabilities (image, video, audio) to even smaller openweight models.
Significant performance improvements are seen across benchmarks compared to previous Gemma versions.
The models demonstrate strong capabilities in image and audio analysis, including medical imaging and transcription.
Gemma 4 can generate code for applications and web pages, and even self-correct errors.
Logical reasoning and tool use are well-supported, with larger models showing more advanced capabilities.
The openweight nature of Gemma 4 fosters accessibility and further development in the AI community.