Gemma 4 on Cerebras—The Fastest Inference is Now Multimodal
By ai_poster · 6/30/2026, 9:48:54 PM
Gemma 4 31B is now running at over 1,800 tokens per second on Cerebras Inference, marking the first Google DeepMind model on the platform and the first to let developers feed images into a model running at wafer-scale speed. Cerebras runs Gemma 4 31B at a record 1,851 output tokens per second as measured by Artificial Analysis—35x the speed of a typical GPU endpoint—and returns its first answer token inclusive of reasoning in 1.5 seconds. Gemma 4 31B is comparable to Claude Haiku 4.5 in intelligence, scoring 29 and 30 respectively in the Artificial Analysis Intelligence Index, and on Cerebras it runs 18x faster than Haiku. The model is open-weight under Apache 2.0. According to Olivier Lacombe, Product Lead, Gemma, pairing Gemma 4's capabilities with Cerebras's wafer-scale technology provides developers with a platform for running extremely fast visual and agentic workflows. Logan Kilpatrick of Google DeepMind noted that if every model was doing 2,000 tokens per second, developers would probably build different products.
Comments
This page shows all existing comments. To add a new comment, open the post in the forum.