Breakthrough Papers

Voyager: An Open-Ended Embodied Agent with Large Language Models

Pioneering work in autonomous AI agents of the GPT-4 era. As one of the earliest papers to implement truly autonomous agents in May 2023, it pioneered a new paradigm for AI Agent self-learning. Through its breakthrough automatic curriculum learning and skill library accumulation mechanism, it achieved for the first time continuous growth of agents in an open world, recognized by academia as a milestone in the field of Autonomous Agents. Its exploration capabilities and skill acquisition efficiency in Minecraft remain the benchmark in the field.

Authors: Wang et al. 2023 arXiv:2305.16291
Example Analysis

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

The GPT moment in robotics. As the first open-source general-purpose humanoid robot foundation model, GR00T N1 has created a new era of robot Foundation Models. Its innovative dual-system architecture (vision-language system + diffusion transformer) achieves end-to-end training from vision-language to fluid motion for the first time, hailed as a GPT breakthrough in robotics. Its excellent performance on real robot platforms marks the true arrival of the era of general-purpose robots.

Authors: NVIDIA et al. 2025 arXiv:2503.14734
Example Analysis

TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools

Agent breakthrough in medical AI. As the first therapeutic reasoning agent integrating 211 medical tools, TxAgent has pioneered a new era of precision medicine. Its multi-step reasoning and real-time biomedical knowledge retrieval capabilities allow AI to achieve thought processes similar to clinical physicians for the first time. It achieves 92.1% accuracy in drug interaction analysis and personalized treatment recommendations, surpassing GPT-4o, and is hailed as a milestone in precision medicine.

Authors: Gao et al. 2025 arXiv:2503.10970
Example Analysis

Genome modeling and design across all domains of life with Evo 2

The GPT moment in life sciences. As the first biological foundation model covering 9.3 trillion DNA base pairs, Evo 2 has created a new era of genome design. Its breakthrough million-token context window allows AI to achieve precise modeling at the whole-genome scale for the first time. The model can autonomously learn biological features and accurately predict the impact of genetic variations, hailed as a milestone in the field of bio-intelligence. This breakthrough marks the official beginning of the AI-driven life science design era.

Authors: Brixi et al. 2025 bioRxiv:2025.02.18.638918
Example Analysis

Cosmos World Foundation Model Platform for Physical AI

Digital twin breakthrough in physical AI. As the first open-source world foundation model platform, Cosmos has pioneered a new paradigm for physical AI training. Through innovative video processing pipelines and pre-trained world models, it achieves complete digital twin training of AI systems for the first time. This breakthrough allows developers to customize dedicated world models for any physical AI system, hailed as a milestone in the field of digital twins, marking the entry of physical AI development into the era of industrial-grade infrastructure.

Authors: NVIDIA et al. 2025 arXiv:2501.03575
Example Analysis

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Milestone breakthrough in reasoning capability. Through the innovative GPRO training paradigm (Generate, Prompt, Reward, Optimize), DeepSeek-R1 achieves fully automated reasoning capability training for the first time. Its zero-shot version DeepSeek-R1-Zero creates new records on GSM8K, MATH, BBH, and other reasoning benchmarks without manual annotation data, relying only on automated GPRO processes, with reasoning capabilities improved by over 50% compared to the baseline. This breakthrough automated training method not only significantly reduces training costs but also creates a new era of AI reasoning capabilities.

Authors: DeepSeek-AI et al. 2025 arXiv:2501.12948
Example Analysis

Survey Papers

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

The most comprehensive prompt engineering resource. Master 41 prompt engineering techniques in one paper, from basic to advanced, from principles to practices, from unimodal to multimodal. Highly practical: Each technique comes with detailed analysis of pros and cons, applicable scenarios, code examples, and dataset recommendations, helping you quickly find the most suitable solution. Considered the most time-saving Prompt Engineering learning guide of 2024.

Authors: Sahoo et al. 2024 arXiv:2402.07927
Example Analysis

A Systematic Survey of Automatic Prompt Optimization Techniques

The tool to eliminate manual prompt debugging. The first systematic summary of Automatic Prompt Optimization (APO) techniques, proposing a revolutionary five-step optimization framework. Comprehensive and practical framework: From seed prompt initialization, evaluation feedback, candidate generation to selection iteration, each link provides best practices and code implementation. Let AI automatically help you find the optimal prompts, greatly improving development efficiency.

Authors: Ramnath,et al. 2025 arXiv:2502.16923
Example Analysis

A Survey on Knowledge-Oriented Retrieval-Augmented Generation

Complete guide to RAG technology. From basic RAG to advanced RAG (multimodal, memory-enhanced, Agentic), it systematically sorts out the entire process of knowledge acquisition, integration, and utilization. Outstanding practical value: In-depth analysis of retrieval mechanisms, generation processes, and optimal integration solutions between the two, with rich evaluation benchmarks and dataset recommendations. Master how to equip AI models with reliable external knowledge acquisition capabilities in one paper.

Authors: Cheng et al. 2025 arXiv:2503.10677
Example Analysis

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

The agent revolution in RAG. The first systematic discussion of how to integrate autonomous agents into the RAG process, breaking through the limitations of traditional RAG static workflows. Comprehensive architecture design: Detailed introduction to single-agent, multi-agent, and hierarchical architecture implementation schemes, as well as practical applications in medicine, finance, education, and other fields. Master how to build a new generation of RAG systems with reflection, planning, and collaboration capabilities in one paper.

Authors: Singh et al. 2025 arXiv:2501.09136
Example Analysis

Survey on Evaluation of LLM-based Agents

Authoritative guide to agent evaluation. The first comprehensive review of LLM agent evaluation methods, covering a complete evaluation system from basic capabilities (planning, tool use, self-reflection, memory) to professional domains (web, software engineering, science, dialogue). Complete evaluation framework: In-depth analysis of the advantages and disadvantages of existing benchmarks, revealing evaluation trends (continuous updates, real scenarios), and pointing out key challenges (cost-effectiveness, security, robustness). Helps you quickly find the most suitable agent evaluation solution.

Authors: Yehudai et al. 2025 arXiv:2503.16416
Example Analysis

From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future

Guide to the AI revolution in software engineering. The first systematic comparison of LLMs and agents in six core software engineering tasks: requirements engineering, code generation, autonomous decision-making, software design, test generation, and software maintenance. Outstanding practical value: In-depth analysis of benchmarks, evaluation metrics, and best practices for each task, revealing LLM limitations (context length, hallucinations, tool use) and agent advantages (autonomy, self-improvement). Helps you make optimal technology selection in software development.

Authors: Jin et al. 2024 arXiv:2408.02479
Example Analysis

Benchmark Papers

Holistic Evaluation of Language Models

The first comprehensive systematic LLM evaluation benchmark. As a milestone project of Stanford CRFM, HELM established the first complete classification system for language model evaluation. Through matrix evaluation of 16 core scenarios and 7 key dimensions (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency), it achieved comprehensive comparison of 30 mainstream models. This pioneering work has evolved model evaluation from 'only comparing accuracy' to 'comprehensive transparent evaluation,' hailed as an important cornerstone of AI governance.

Authors: Liang et al. 2022 arXiv:2211.09110
Example Analysis

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

The first expert-level multi-disciplinary multimodal evaluation benchmark. As the first multimodal evaluation benchmark for expert-level AGI, MMMU collected 11.5K high-quality questions from university exams, quizzes, and textbooks. Covering 6 core disciplines, 30 majors, and 183 sub-fields across arts design, business, science, medicine, humanities, social sciences, and engineering technology, it includes 30 types of highly heterogeneous images. Its evaluation results show that even the most advanced GPT-4V and Gemini Ultra only achieve 56% and 59% accuracy, highlighting the enormous challenges on the road to expert-level AGI.

Authors: Yue et al. 2023 arXiv:2311.16502
Example Analysis

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

The first real-world software engineering evaluation benchmark. As the first evaluation benchmark based on real GitHub issues, SWE-bench collected 2,294 actual issues and corresponding PRs from 12 popular Python repositories. These issues require models to understand and coordinate changes between multiple functions, classes, or even files, interact with execution environments, and handle extra-long contexts, far exceeding the complexity of traditional code generation tasks. This benchmark has become the gold standard for measuring the actual engineering capabilities of AI agents, and any intelligent agent claiming to have software engineering capabilities needs to be tested on this benchmark. The performance of the strongest models on this benchmark has improved from less than 2% to over 70%, marking a major breakthrough of AI in actual software engineering scenarios.

Authors: Jimenez et al. 2023 arXiv:2310.06770
Example Analysis

GSM8K: Training Verifiers to Solve Math Word Problems

The first high-quality mathematical reasoning verification benchmark. As the first mathematical evaluation benchmark to introduce validator training, GSM8K collected 8.5K linguistically diverse elementary school math word problems. Its innovative validator training method allows models to judge the correctness of solutions by generating multiple candidate solutions and selecting the optimal one to improve performance. Experiments show that this validation method significantly improves model performance on math word problems, and as the data scale increases, its effect is superior to traditional fine-tuning methods.

Authors: Cobbe et al. 2021 arXiv:2110.14168
Example Analysis

MATH: Evaluating Math Reasoning via Mathematical Problem Solving

The first competition-level mathematical reasoning benchmark. As the first evaluation benchmark for mathematics competition difficulty, MATH collected 12.5K high school and university level math problems covering algebra, geometry, calculus, and other fields. Each problem is equipped with complete step-by-step solutions that can be used to train models to generate answer derivations and explanations. Research has found that even the largest Transformer models still have low accuracy on this benchmark, and breakthroughs cannot be achieved by simply increasing model parameters and computational budgets, highlighting the essential challenge of mathematical reasoning abilities.

Authors: Hendrycks et al. 2021 arXiv:2103.03874
Example Analysis

Humanity's Last Exam

Humanity's last closed-book exam. As the ultimate evaluation benchmark created by top global research institutions, HLE collected 3,000 high-difficulty multimodal questions at the frontier of human knowledge, covering dozens of disciplines including mathematics, humanities, and natural sciences. Each question is carefully designed not only to challenge the limits of AI capabilities but also to provide a key scale for technology development trajectories and risk governance. Even the latest top models perform poorly on this benchmark: DeepSeek-R1 only achieves 9.4% accuracy, o1 9.1%, Gemini 2.0 Flash Thinking only 6.2%, and Claude 3.5 Sonnet even lower at 4.3%. Such low scores highlight the still huge gap between AI and human expert knowledge, and sound an alarm bell for the future safe development of AI. This benchmark is hailed as 'humanity's last closed-book exam.'

Authors: Phan et al. 2025 arXiv:2501.14249
Example Analysis