Pioneering work in autonomous AI agents of the GPT-4 era. As one of the earliest papers to implement truly autonomous agents in May 2023, it pioneered a new paradigm for AI Agent self-learning. Through its breakthrough automatic curriculum learning and skill library accumulation mechanism, it achieved for the first time continuous growth of agents in an open world, recognized by academia as a milestone in the field of Autonomous Agents. Its exploration capabilities and skill acquisition efficiency in Minecraft remain the benchmark in the field.
The GPT moment in robotics. As the first open-source general-purpose humanoid robot foundation model, GR00T N1 has created a new era of robot Foundation Models. Its innovative dual-system architecture (vision-language system + diffusion transformer) achieves end-to-end training from vision-language to fluid motion for the first time, hailed as a GPT breakthrough in robotics. Its excellent performance on real robot platforms marks the true arrival of the era of general-purpose robots.
Agent breakthrough in medical AI. As the first therapeutic reasoning agent integrating 211 medical tools, TxAgent has pioneered a new era of precision medicine. Its multi-step reasoning and real-time biomedical knowledge retrieval capabilities allow AI to achieve thought processes similar to clinical physicians for the first time. It achieves 92.1% accuracy in drug interaction analysis and personalized treatment recommendations, surpassing GPT-4o, and is hailed as a milestone in precision medicine.
The GPT moment in life sciences. As the first biological foundation model covering 9.3 trillion DNA base pairs, Evo 2 has created a new era of genome design. Its breakthrough million-token context window allows AI to achieve precise modeling at the whole-genome scale for the first time. The model can autonomously learn biological features and accurately predict the impact of genetic variations, hailed as a milestone in the field of bio-intelligence. This breakthrough marks the official beginning of the AI-driven life science design era.
Digital twin breakthrough in physical AI. As the first open-source world foundation model platform, Cosmos has pioneered a new paradigm for physical AI training. Through innovative video processing pipelines and pre-trained world models, it achieves complete digital twin training of AI systems for the first time. This breakthrough allows developers to customize dedicated world models for any physical AI system, hailed as a milestone in the field of digital twins, marking the entry of physical AI development into the era of industrial-grade infrastructure.
Milestone breakthrough in reasoning capability. Through the innovative GPRO training paradigm (Generate, Prompt, Reward, Optimize), DeepSeek-R1 achieves fully automated reasoning capability training for the first time. Its zero-shot version DeepSeek-R1-Zero creates new records on GSM8K, MATH, BBH, and other reasoning benchmarks without manual annotation data, relying only on automated GPRO processes, with reasoning capabilities improved by over 50% compared to the baseline. This breakthrough automated training method not only significantly reduces training costs but also creates a new era of AI reasoning capabilities.
The most comprehensive prompt engineering resource. Master 41 prompt engineering techniques in one paper, from basic to advanced, from principles to practices, from unimodal to multimodal. Highly practical: Each technique comes with detailed analysis of pros and cons, applicable scenarios, code examples, and dataset recommendations, helping you quickly find the most suitable solution. Considered the most time-saving Prompt Engineering learning guide of 2024.
The tool to eliminate manual prompt debugging. The first systematic summary of Automatic Prompt Optimization (APO) techniques, proposing a revolutionary five-step optimization framework. Comprehensive and practical framework: From seed prompt initialization, evaluation feedback, candidate generation to selection iteration, each link provides best practices and code implementation. Let AI automatically help you find the optimal prompts, greatly improving development efficiency.
Complete guide to RAG technology. From basic RAG to advanced RAG (multimodal, memory-enhanced, Agentic), it systematically sorts out the entire process of knowledge acquisition, integration, and utilization. Outstanding practical value: In-depth analysis of retrieval mechanisms, generation processes, and optimal integration solutions between the two, with rich evaluation benchmarks and dataset recommendations. Master how to equip AI models with reliable external knowledge acquisition capabilities in one paper.
The agent revolution in RAG. The first systematic discussion of how to integrate autonomous agents into the RAG process, breaking through the limitations of traditional RAG static workflows. Comprehensive architecture design: Detailed introduction to single-agent, multi-agent, and hierarchical architecture implementation schemes, as well as practical applications in medicine, finance, education, and other fields. Master how to build a new generation of RAG systems with reflection, planning, and collaboration capabilities in one paper.
Authoritative guide to agent evaluation. The first comprehensive review of LLM agent evaluation methods, covering a complete evaluation system from basic capabilities (planning, tool use, self-reflection, memory) to professional domains (web, software engineering, science, dialogue). Complete evaluation framework: In-depth analysis of the advantages and disadvantages of existing benchmarks, revealing evaluation trends (continuous updates, real scenarios), and pointing out key challenges (cost-effectiveness, security, robustness). Helps you quickly find the most suitable agent evaluation solution.
Guide to the AI revolution in software engineering. The first systematic comparison of LLMs and agents in six core software engineering tasks: requirements engineering, code generation, autonomous decision-making, software design, test generation, and software maintenance. Outstanding practical value: In-depth analysis of benchmarks, evaluation metrics, and best practices for each task, revealing LLM limitations (context length, hallucinations, tool use) and agent advantages (autonomy, self-improvement). Helps you make optimal technology selection in software development.
The first comprehensive systematic LLM evaluation benchmark. As a milestone project of Stanford CRFM, HELM established the first complete classification system for language model evaluation. Through matrix evaluation of 16 core scenarios and 7 key dimensions (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency), it achieved comprehensive comparison of 30 mainstream models. This pioneering work has evolved model evaluation from 'only comparing accuracy' to 'comprehensive transparent evaluation,' hailed as an important cornerstone of AI governance.
The first expert-level multi-disciplinary multimodal evaluation benchmark. As the first multimodal evaluation benchmark for expert-level AGI, MMMU collected 11.5K high-quality questions from university exams, quizzes, and textbooks. Covering 6 core disciplines, 30 majors, and 183 sub-fields across arts design, business, science, medicine, humanities, social sciences, and engineering technology, it includes 30 types of highly heterogeneous images. Its evaluation results show that even the most advanced GPT-4V and Gemini Ultra only achieve 56% and 59% accuracy, highlighting the enormous challenges on the road to expert-level AGI.
The first real-world software engineering evaluation benchmark. As the first evaluation benchmark based on real GitHub issues, SWE-bench collected 2,294 actual issues and corresponding PRs from 12 popular Python repositories. These issues require models to understand and coordinate changes between multiple functions, classes, or even files, interact with execution environments, and handle extra-long contexts, far exceeding the complexity of traditional code generation tasks. This benchmark has become the gold standard for measuring the actual engineering capabilities of AI agents, and any intelligent agent claiming to have software engineering capabilities needs to be tested on this benchmark. The performance of the strongest models on this benchmark has improved from less than 2% to over 70%, marking a major breakthrough of AI in actual software engineering scenarios.
The first high-quality mathematical reasoning verification benchmark. As the first mathematical evaluation benchmark to introduce validator training, GSM8K collected 8.5K linguistically diverse elementary school math word problems. Its innovative validator training method allows models to judge the correctness of solutions by generating multiple candidate solutions and selecting the optimal one to improve performance. Experiments show that this validation method significantly improves model performance on math word problems, and as the data scale increases, its effect is superior to traditional fine-tuning methods.
The first competition-level mathematical reasoning benchmark. As the first evaluation benchmark for mathematics competition difficulty, MATH collected 12.5K high school and university level math problems covering algebra, geometry, calculus, and other fields. Each problem is equipped with complete step-by-step solutions that can be used to train models to generate answer derivations and explanations. Research has found that even the largest Transformer models still have low accuracy on this benchmark, and breakthroughs cannot be achieved by simply increasing model parameters and computational budgets, highlighting the essential challenge of mathematical reasoning abilities.
Humanity's last closed-book exam. As the ultimate evaluation benchmark created by top global research institutions, HLE collected 3,000 high-difficulty multimodal questions at the frontier of human knowledge, covering dozens of disciplines including mathematics, humanities, and natural sciences. Each question is carefully designed not only to challenge the limits of AI capabilities but also to provide a key scale for technology development trajectories and risk governance. Even the latest top models perform poorly on this benchmark: DeepSeek-R1 only achieves 9.4% accuracy, o1 9.1%, Gemini 2.0 Flash Thinking only 6.2%, and Claude 3.5 Sonnet even lower at 4.3%. Such low scores highlight the still huge gap between AI and human expert knowledge, and sound an alarm bell for the future safe development of AI. This benchmark is hailed as 'humanity's last closed-book exam.'