Zhipu Releases GLM-5.1 High-Speed API, Setting Global Speed Record at 400 Tokens/s

According to monitoring by Dongcha Beating, Zhipu has launched the GLM-5.1 high-speed API for select enterprise clients, achieving a model output speed of 400 tokens/s, setting a new global record for the end-to-end speed limit of large model official interfaces. This high-speed version retains the capabilities of the original flagship model while being powered by a high-performance inference engine developed jointly by Zhipu and the TileRT team. This engine completely restructured the GPU's operational scheduling mechanism, statically arranging the model into a persistent Engine Kernel that resides on the GPU during the compilation phase. During single-card inference, computation, asynchronous I/O, and communication are all decomposed into tile-level micro-tasks, initiating the kernel only once. Intermediate results between operators are directly transmitted via registers and shared caches, eliminating the latency caused by frequent kernel startups and memory read/write in traditional inference. When scaled to a multi-card setup, TileRT further extends the specialization parallelism approach across an 8-card NVL topology, transforming originally homogeneous GPU nodes into heterogeneous Workers responsible for different tasks. In handling the attention layer computations of GLM-5.1, the system assigns GPU 0 to run a sparse index Worker dedicated to sparse index construction and routing decisions, while GPUs 1 to 7 run MLA Workers responsible for computation-intensive phases, fully integrating communication into the tile-level task pipeline, achieving deep overlap between computation and inter-card communication. This high-speed service is currently available to select enterprise clients on the Zhipu MaaS platform. In the future, this technology will further optimize FP8 inference and ultra-long context production environments, providing more deterministic performance support for low-latency sensitive scenarios such as AI programming, real-time interaction, and real-time voice.
ZHIPU26.55%
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned