A Mid-Sized Open Model That Skips the Encoder Entirely
Google DeepMind put out Gemma 4 12B, and here's the part that makes it interesting: it's an open-source multimodal model that handles text, images, and audio without leaning on dedicated encoders. That's a first for a mid-sized open-weight model. The whole thing weighs in at 12 billion parameters and fits inside 16GB of VRAM or unified memory, which means you can run multimodal AI inference on regular consumer hardware. Not a server farm. Not a rented cluster. The machine you probably already own.
It also fills a hole in the Gemma 4 lineup. The family arrived in April with four variants — the edge-friendly E2B and E4B models on the small end, and the heftier 26B Mixture of Experts and 31B Dense configurations on the other. Those earlier models all relied on vision transformer layers and conformer-based audio encoders to make sense of non-text inputs. The new 12B variant throws both out and uses what Google calls a "Unified" architecture instead.
How the Encoder-Free Design Actually Works
Think about how a typical multimodal model handles things. You've got separate encoder modules that chew on images and audio first, then hand off their representations to the language model backbone. It's a relay race. Gemma 4 12B skips a few legs of that race.
Projecting Pixels and Sound Straight Into the Model
Instead of a full vision encoder — usually somewhere between 15 and 27 transformer layers — the 12B model uses a lightweight 35-million-parameter embedding module. That module projects raw pixel patches directly into the LLM's token space using a single matrix multiplication with factorized 2D positional embeddings. Audio takes a similar shortcut. Raw 16 kHz waveforms get sliced into 40-millisecond frames and projected straight into the same dimensional space as text tokens, with no separate speech recognition encoder in between.
Why This Matters for Speed and Tuning
The payoff is latency. Because there's no encoder pipeline to wait on, the LLM can start working on inputs sooner. And there's a second benefit that's easy to overlook: fine-tuning gets simpler. A single LoRA pass can update vision, audio, and text weights all at once, rather than wrangling each modality separately.
Performance That Punches Above Its Size
Here's where it gets surprising. Google says the 12B model gets close to the performance of the bigger 26B MoE variant on standard benchmarks — while using less than half the memory. The reported numbers back that up: 77.2% on MMLU Pro and 78.8% on GPQA Diamond. For a model you can run locally, those are figures worth paying attention to.
Licensing and Day-One Tooling Support
The model ships under an Apache 2.0 license, which is the commercially permissive kind. Google first adopted that license for the Gemma family back with the April Gemma 4 release. On availability, the 12B variant landed with support across a wide spread of tools right out of the gate: llama.cpp, vLLM, MLX, Ollama, LM Studio, and Unsloth.
A Local-First Experience on Apple Silicon
This release lines up with Google's continued push into local-first tooling for macOS. There's an open-source Electron app called Gemma Chat that runs Gemma 4 models locally on Apple Silicon Macs through Apple's MLX framework, and it now supports the new 12B variant alongside the earlier ones. The app gives you two ways to work — a coding agent mode and a conversational mode with voice input powered by local speech-to-text. Everything stays on the machine. Your prompts, the generated content, all of it lives on-device.

