Grok 4.20 Beta achieved a 97% accuracy rate in the τ²-Bench evaluation, ranking second.

ME News message, April 5 (UTC+8), recently, Grok 4.20 Beta achieved 97% accuracy in the τ²-Bench evaluation, ranking second. τ²-Bench is an evaluation built on the Sierra original τ-bench framework and is known for its rigor. This evaluation not only tests whether AI can answer questions, but also tests whether agents can successfully complete navigation tasks. (Source: InFoQ)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments