Chinese models previously swept the top ten on the last SWE-bench benchmark and were mocked for "benchmark gaming," this time they occupied four seats.

BlockBeatNews

According to 1M AI News monitoring, SWE-rebench is a real-time benchmark that extracts new software engineering tasks (issues + PRs) from GitHub every month. The model cannot optimize in advance for specific tasks. Maintainer Ibragim announced an update to the leaderboard on March 23, removing previous sample demonstrations and the 80-step operation limit, and adding auxiliary evaluation tasks.

Latest top ten rankings:

  1. Claude Opus 4.6: 65.3%
  2. GPT-5.2 medium: 64.4%
  3. GLM-5: 62.8%
  4. GPT-5.4 medium: 62.8%
  5. Gemini 3.1 Pro Preview: 62.3%
  6. DeepSeek-V3.2: 60.9%
  7. Claude Sonnet 4.6: 60.7%
  8. Claude Sonnet 4.5: 60.0%
  9. Qwen3.5-397B-A17B: 59.9%
  10. Step-3.5-Flash: 59.6%

Zhizhi AI’s open-source model GLM-5 (MIT License) ranks third with 62.8%, the highest among open-source models on the list. Four Chinese models are in the top ten, including DeepSeek-V3.2 (6th), Alibaba Tongyi Qianwen Qwen3.5-397B-A17B (9th), and Step-3.5-Flash (10th). Zhizhi Z.ai’s global head Li Zixuan commented that during the last SWE-rebench update, Chinese models all fell outside the top ten and were criticized for “benchmaxing” (score boosting).

View Original
Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.
Comment
0/400
No comments