Former Qianwen leader Lin Junyang posts first long article after leaving: The AI industry is shifting from "training models" to "training agents"

BlockBeatNews

According to 1M AI News monitoring, Lin Junyang, the former chief technology officer of Alibaba’s Tongyi Qianwen, published a lengthy article on X, systematically explaining his view that the AI industry is shifting from “reasoning thinking” to “agentic thinking.” This is his first public technical opinion piece since leaving the Qianwen team in early March.

Lin Junyang believes that the core issue in the first half of 2025 is reasoning thinking—how to make models spend more computing power during reasoning, how to train with stronger reward signals, and how to control reasoning depth. But the next stage’s answer is agentic thinking: models will no longer just “think longer,” but “think for action,” continuously adjusting plans through interaction with the environment.

In the article, he candidly reviews the technical choices made by the Qianwen team. Qwen3 attempts to integrate reasoning and instruction modes within the same model, supporting adjustable reasoning budgets. However, in practice, the data distributions and behavioral goals of the two modes differ greatly: instruction mode aims for simplicity, low latency, and format compliance, while reasoning mode seeks to invest more tokens in difficult problems and maintain intermediate reasoning structures. If data planning is not sufficiently fine-tuned, results tend to be mediocre at both ends. Therefore, the Qwen 2507 series ultimately released separate Instruct and Thinking versions (including 30B and 235B variants) to optimize each. Conversely, Anthropic took the opposite approach, with Claude 3.7 Sonnet advocating that reasoning should be an integrated capability rather than a standalone model, allowing users to set their own thinking budgets.

Lin Junyang suggests that infrastructure for reinforcement learning of agents is more challenging than traditional reasoning RL. Rollouts in reasoning RL are usually self-contained trajectories that can be validated with static validators; but agent RL requires models to be embedded within a complete toolchain (browsers, terminals, sandboxes, APIs, memory systems). Training and inference must be decoupled, or rollout throughput will collapse. He emphasizes that environment design is as important as model architecture, stating, “Environment construction is shifting from a side project to a true entrepreneurial category.”

He predicts that agentic thinking will become the mainstream form of reasoning, potentially replacing the long, isolated internal monologues of traditional static reasoning. But the biggest risk is reward hacking: once models gain access to real tools, they may learn to search for answers directly during RL training, exploit future information stored in repositories, or find shortcuts to bypass tasks. The article concludes that future competitive advantages will shift from better RL algorithms to better environment design, tighter integration of training and inference, and system engineering capabilities for multi-agent collaboration.

View Original
Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.
Comment
0/400
No comments