Zhipu reviews GLM-5 "Gibberish Gate": Hundreds of millions of Coding Agent calls per day, with two race condition bugs hidden in the KV Cache

According to Beating Monitoring, Zhipu published a review stating that issues such as garbled text, repetition, and rare characters occurred with the GLM-5 series models in the Coding Agent scenario. Since March, some users have reported that these anomalies only trigger during high concurrency and long context (averaging over 70K tokens) Coding Agent tasks, and cannot be reproduced under standard inference environments.

Zhipu states that their inference system handles hundreds of millions of Coding Agent calls daily.

After several weeks of investigation, the team identified two independent underlying race condition bugs. The first occurred in the PD separation architecture (a deployment method that separates pre-filling and decoding onto different nodes): when the decoder side times out and aborts a request, it recovers the KV Cache (which stores computed attention states to avoid recomputation). However, the RDMA write on the pre-filling side had not yet finished, so a new request was assigned to the same GPU memory, causing old data to overwrite new data. The fix was to add explicit synchronization before recovery, ensuring the write completes before releasing the cache. After deployment, the anomaly rate dropped from over ten parts per ten thousand to below three parts per ten thousand.

The second bug was in HiCache (multi-level KV Cache): when asynchronously swapping data from CPU memory into the cache, there was a missing synchronization point between the loading and computation pipelines, causing the computation side to start reading before data loading was complete. After fixing this, such anomalies disappeared entirely, and the patch was submitted to the SGLang community (PR #22811).

During the investigation, an unexpected discovery was made: the acceptance rate metric of speculative sampling (a technique that guesses tokens with a small model and verifies with a larger model to accelerate inference) can serve as an anomaly detection signal. During garbled text, draft tokens are almost entirely rejected; during repetition, the acceptance rate is abnormally high. Based on this, the team implemented online monitoring: when a threshold is triggered, generation is automatically aborted and retried.

After fixing the bugs, the team also optimized a bottleneck: LayerSplit KV Cache layered storage, where each GPU only stores part of the layer’s KV Cache instead of the full set, coordinating via broadcast for computation. With a 90% cache hit rate, request lengths increased from 40K to 120K, resulting in throughput improvements of 10% to 132%, with longer contexts gaining even more benefits.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments