2026-04-10 13:57:32

Gemma 4 is finally stable on llama.cpp

On April 2nd, Google released Gemma 4, and on the first day, llama.cpp support was available but with many bugs. Now all issues are fixed
E2B, E4B, 26B MoE, 31B Dense
31B ranks third in Arena AI leaderboard, 26B ranks sixth
The strongest tier of open-source models
Use --chat-template-file to load interleaved templates
It is recommended to enable --cache-ram 2048
Context length depends on VRAM
Last year, the best local model was Llama 3.1 70B quantized version, barely usable
Now, Gemma 4 31B Q5 runs smoothly on Mac Studio, approaching GPT-4 level
AI applications that do not rely on APIs are starting to have commercial viability. Data stays on the local machine, zero cost, extremely low latency
For a one-person business, local models are the real infrastructure. While competitors pay API fees, your marginal cost is just electricity
Gemma 4 + llama.cpp = the optimal solution for local inference, ready for production

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes