SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

Code and data for paper: SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

1Shanghai Jiao Tong University, 2Beijing University of Aeronautics and Astronautics,
3Soochow University, 5University of Michigan

Abstract

Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on hard split, underscoring the value of its high-quality training data. Read more about SWE-Dev in our paper!


Features

1 Realistic scale and complexity
SWE-Dev requires substantial code modifications (avg. 190 LOC across 3 files), challenging models with the cross-file dependencies, large contexts, and significant implementation scope characteristic of real-world feature development.
2 Robust and grounded evaluation
Each SWE-Dev sample is grounded in a real open-source repository, guided by a well-defined project requirement description (PRD), and evaluated using executable test cases to ensure the functional correctness of the proposed implementation. This design ensures alignment between task objectives and evaluation, enabling robust assessment and model supervision.
3 Verifiable training set with executable test suites
Uniquely, all 14,000 training instances are paired with runnable environments and executable unit tests, providing crucial execution-based feedback that support:
Supervised Fine-Tuning (SFT)
Reinforcement Learning (RL) with accurate rewards
Multi-Agent System (MAS) training

Leaderboard

# Model Params Date Pass @1 (%) Pass @3 (%)
Easy Hard Easy Hard
1
Qwen3-8B
Qwen
8B 2025-04-29 34.04 12.09 39.26 13.33
2
Qwen3-8B🧠
Qwen
8B 2025-04-29 19.47 6.36 25.91 9.22
3
Qwen3-30B-A3B
Qwen
30B 2025-04-29 35.84 12.76 39.45 15.20
4
Qwen3-30B-A3B🧠
Qwen
30B 2025-04-29 23.63 8.30 31.00 11.60
5
o3🧠
OpenAI
- 2025-04-16 51.21 21.86 59.05 28.98
6
QwQ-32B-Preview 🧠
Qwen
32B 2025-03-06 4.50 0.70 8.90 1.22
7
Claude-3.7-Sonnet
Anthropic
- 2025-02-25 53.09 19.74 56.35 24.25
8
Claude-3.7-Sonnet 🧠
Anthropic
- 2025-02-25 49.47 22.51 56.58 29.28
9
grok-3-beta 🧠
xAI
- 2025-02-19 53.63 18.97 59.08 22.26
10
Deepseek-R1-distill-Qwen2.5-7B 🧠
deepseek-ai
7B 2025-01-20 6.30 1.29 10.30 1.95
11
Deepseek-R1-distill-Qwen2.5-32B 🧠
deepseek-ai
32B 2025-01-20 24.25 9.79 40.53 19.04
12
DeepSeek-R1-distill-Llama-70B 🧠
deepseek-ai
70B 2025-01-20 32.73 8.19 45.72 11.33
13
Deepseek-R1 🧠
deepseek-ai
- 2025-01-20 28.55 12.84 37.62 17.72
14
Deepseek-V3
deepseek-ai
- 2024-12-26 41.95 16.22 56.79 21.62
15
Phi-4
microsoft
14B 2024-12-12 21.99 5.57 27.89 8.56
16
Llama3.3-70B-Instrcut
microsoft
70B 2024-12-06 33.84 12.85 39.57 14.95
17
o1 🧠
OpenAI
- 2024-12-05 36.36 11.09 43.77 14.27
18
Qwen2.5-Coder-14B-Instruct
Qwen
14B 2024-11-12 39.51 14.82 52.49 18.44
19
Qwen2.5-1.5B-Instruct
Qwen
1.5B 2024-09-19 8.05 1.23 10.76 2.22
20
Qwen2.5-3B-Instruct
Qwen
3B 2024-09-19 15.93 5.27 21.99 7.47
21
Qwen2.5-7B-Instruct
Qwen
7B 2024-09-19 25.74 6.68 33.35 7.73
22
Qwen2.5-14B-Instruct
Qwen
14B 2024-09-19 38.08 13.16 46.32 15.89
23
Qwen2.5-32B-Instruct
Qwen
32B 2024-09-19 43.64 10.15 51.24 11.69
24
Qwen2.5-72B-Instruct
Qwen
72B 2024-09-19 49.01 10.62 57.20 12.33
25
Llama-3.1-8B-Instruct
meta-llama
8B 2024-07-23 26.43 7.94 33.01 10.24
26
GPT-4o-mini
OpenAI
- 2024-07-18 34.47 11.09 41.94 13.84
27
DeepSeek-Coder-V2-Lite-Instruct
deepseek-a
- 2024-06-17 21.53 8.19 29.68 11.33
28
GPT-4o
OpenAI
- 2024-05-13 54.37 19.13 68.70 21.91
# Method Date Easy (%) Hard (%)
Pass@1 Calls Prices($) Pass@1 Calls Prices($)
1
Reflexion
2023-10-10 39.77 2.12 0.83 13.32 2.18 1.35
1
Self Refine
2023-05-26 40.02 5.00 5.78 20.03 5.00 5.8
1
Self Consistency
2023-03-08 37.62 6.00 4.30 18.55 6.00 7.08
1
LLM Debate
2023-05-24 38.48 7.00 5.95 14.56 7.00 9.35
1
MAD
2024-10-09 31.50 7.00 2.48 15.31 7.00 3.40
1
Agentverse
2023-10-23 38.67 4.52 1.40 13.42 4.83 2.90
1
EvoMAC
2024-10-22 34.59 7.98 3.20 13.60 8.30 4.65
1
MetaGPT
2024-12-01 29.56 9.69 2.20 9.25 10.37 4.95
1
MapCoder
2024-05-19 24.55 21.01 6.05 5.87 23.41 10.55
1
ChatDev
2024-06-05 35.13 26.61 3.53 11.70 30.87 6.10

📍We use 🧠 to represent the Reasoning Model.


We have compared the Pass@3 scores for 17 chatbot LLMs and 10 reasoning LLMs on SWE-Dev.

1 SWE-Dev Poses High Challenge
SWE-Dev presents substantial challenges for current LLMs, with even the best-performing Claude-3.7-Sonnet achieving only 22.45% Pass@3 on the Hard split, revealing a clear gap between existing AI coding capabilities and real-world software engineering demands.
2 Model Performance Patterns
All LLMs perform better on the Easy split than the Hard split, and performance generally scales with model size within the same family, demonstrating that SWE-Dev can effectively distinguish different model capabilities and aligns with our understanding of LLM capability scaling.
3 Reasoning Model Performance
Reasoning models generally underperform their corresponding base chatbot models, with Claude-3.7-Sonnet being the exception, indicating that current reasoning strategies do not consistently translate into gains for complex, repository-level generation tasks.

Training Support

1️⃣ Single Agent Training

SFT

RL

We evaluate SWE-Dev's support for different training methods, including SFT, RL.

2️⃣ MultiAgent Training

MAS has shown promising results on SWE-Dev, and we further investigate the training process of MAS on this dataset. As depicted in Fig. 7, the ground truth test case supervision in SWE-Dev enables EvoMAC to improve its performance across multiple rounds of reasoning. This iterative refinement process motivates us to explore EvoMAC as the MAS for training in SWE-Dev. We apply rejection sampling to enhance agent performance via role-wise training.



Dataset Construction


Step 1: We collect real-world repositories with passing test files in Dockerized environments

Step 2: trace test executions to construct function-level call trees linking test cases to invoked source code

Step 3: mask core functions while generating refined PRDs to create tasks. Each sample includes an incomplete repository, a natural language requirement, and executable test cases-enabling realistic, verifiable feature development.



SWE-Dev consists of 14,000 training samples and 500 test samples derived from over 1,000 open-source projects, with the test set manually curated and split into easy (250 instances) and hard (250 instances) difficulty levels. The dataset demonstrates substantial scale and complexity, with each sample requiring an average of 190 lines of code across 3 files and involving approximately 6 target functions for implementation. The average Project Requirement Document (PRD) contains 1,845 tokens, and each sample includes around 6 unit tests for functional evaluation.


BibTeX

@article{du2025swe,
  title={SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development},
  author={Du, Yaxin and Cai, Yuzhu and Zhou, Yifan and Wang, Cheng and Qian, Yu and Pang, Xianghe and Liu, Qian and Hu, Yue and Chen, Siheng},
  journal={arXiv preprint arXiv:2505.16975},
  year={2025}
}