Zhipu releases GLM-5 technical details: engineering-grade intelligence, compatible with domestic computing power

SnapshotLaborer · 2026-02-22T11:25:44+00:00

The release of GLM-5 marks a significant advancement in Chinese AI in terms of model architecture and native hardware adaptation. Its innovations include sparse attention mechanisms and asynchronous reinforcement learning, which significantly improve training efficiency and system capabilities, demonstrating strong engineering intelligence. At the same time, GLM-5 fully adapts to domestic computing power, promoting China's AI to move toward full-stack optimization, marking the shift from catching up to establishing an independent technological ecosystem.

SnapshotLaborer

2026-02-22 11:25:44

Abstract generation in progress

On February 12th, Zhipu released GLM-5, stunning the industry. Ten days later, a technical report was published, offering a glimpse into the intrinsic nature of the GLM-5 model.

What’s interesting isn’t just climbing another leaderboard, but the shift in mindset: no longer comparing parameter sizes, but focusing on system engineering capabilities.

The three core achievements of GLM-5 are quite practical: 1. The model can now perform complex tasks, not just write a few lines of code; 2. Training efficiency has advanced significantly, making large models no longer purely a money-burning game; 3. Fully adapted to domestic chips from the bottom layer to inference frameworks — this is the most critical.

If before it was “China catching up,” now it has begun building its own technical ecosystem.

From “Providing Code” to “Building Systems”

The report introduces a conceptual shift: from Vibe Coding to Agentic Engineering. The former is you give a prompt, and I provide code; the latter is you set a goal, I plan and decompose it myself, write code, tune tools, debug, iterate — until the entire system is completed.

The focus of GLM-5 is no longer just scoring individual tasks, but on:

Context length around 200K (equivalent to hundreds of pages of documents)

Cross-file software engineering tasks

Continuous planning and adjustment over long-term projects

Multi-turn interactions maintaining consistent reasoning

For example, Vending-Bench 2 requires “simulate a vending machine operation for a year,” and ultimately check the account balance. GLM-5 ranks first among open-source models, close to Claude Opus 4.5. This tests long-term decision-making ability, not just Q&A.

The model is beginning to demonstrate “engineering-grade intelligence.”

Sparse Attention: No More Mindless Computation

GLM-5 has 744 billion parameters (with 40 billion active), trained on 285 trillion tokens. Using traditional architecture, the computational cost would explode.

The core innovation is DSA (DeepSeek Sparse Attention). Traditional attention mechanisms “look at everything,” with quadratic complexity; DSA dynamically determines “which tokens are truly important,” computing only the critical parts.

Under a context length of 200K, DSA reduces attention computation by 1.5 to 2 times.

And — without loss.

Other efficient attention methods often sacrifice accuracy, but DSA maintains performance through continued pretraining and smooth transitions, with no degradation.

The results are:

Same compute → longer context
Same cost → higher inference capability
Same hardware → larger models

For China, efficiency innovation is far more important than simply stacking more compute.

Reconstruction of Reinforcement Learning Architecture

GLM-5’s RL system has been thoroughly overhauled.

Generation and training are decoupled. The model’s generation trajectory is produced, while training occurs asynchronously on a separate system. Previously, training had to wait for the slowest task to finish; now, whoever finishes first trains first, greatly increasing throughput. This is crucial for long-term agent tasks.

The asynchronous agent RL algorithm addresses the challenge of tasks lasting hours in real software engineering. It introduces:

Token-in-Token-out (avoiding re-tokenization errors)
Bidirectional importance sampling
DP-aware routing optimization for KV cache

The model can learn stably in complex environments without collapsing due to policy shifts.

In simple terms, it solves the problem of “how to enable large models to continuously self-improve on real tasks.”

The truly critical step: adapting to domestic computing power

The most important part of the report for China’s AI development is here.

GLM-5 is natively compatible with domestic GPU ecosystems, already supporting Huawei Ascend, Moore Thread, Hygon, Cambrian, Kunlun Chip, Tiannanshi, and Suiyuan.

It’s not just “able to run,” but involves:

Optimized KV cache scheduling
Communication mechanism adaptation
Hybrid precision training matching
INT4 quantization-aware training alignment
Distributed parallel strategy reconstruction

Many challenges in domestic chip ecosystems are software-related, not compute.

The significance of GLM-5 lies in its system-level adaptation to multiple domestic hardware platforms, rather than designing around a single overseas architecture.

This is a qualitative leap — Chinese large models are beginning to optimize engineering around native hardware ecosystems, no longer passively migrating.

Thanks to these extreme soft- and hardware co-optimizations, GLM-5’s performance on a single domestic compute node can now rival that of a cluster of two mainstream international GPUs; moreover, in long-sequence scenarios, deployment costs have been reduced by over 50%.

A closed loop of hardware and software is forming

Breaking down GLM-5’s technical pathway reveals a complete closed loop:

Model architecture innovation (DSA) → Training efficiency improvements (asynchronous RL) → Memory and communication compression (ZeRO, activation offloading) → Low-precision alignment (INT4 QAT) → Deep adaptation to domestic chips

This forms a full Chinese AI engineering chain.

China’s AI advantage, previously at the application layer, is now expanding into architecture innovation, algorithm engineering, training systems, chip adaptation, and inference frameworks.

The true significance of this technical report isn’t just benchmark scores but the first demonstration of China’s AI competitiveness through “systemic capability.”

From Showcasing to Maturity

The GLM-5 report doesn’t overly emphasize “how much better we are,” but details the training process, algorithm choices, engineering trade-offs, and ablation experiments. This itself reflects maturity.

When a model begins discussing GPU utilization, tail latency, KV cache reuse, quantization kernel alignment, and catastrophic forgetting control — it’s no longer just showcasing capability, but building industrial-grade systems.

For China, GLM-5 is more like a declaration: we can build large models, develop our own hardware adaptation, and connect the two.

This is the real leap.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

2 Likes