Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on hard split, underscoring the value of its high-quality training data. Read more about SWE-Dev in our paper!
# | Model | Params | Date | Pass @1 (%) | Pass @3 (%) | ||
---|---|---|---|---|---|---|---|
Easy | Hard | Easy | Hard | ||||
1 |
Qwen3-8B
Qwen
|
8B | 2025-04-29 | 34.04 | 12.09 | 39.26 | 13.33 |
2 |
Qwen3-8B🧠
Qwen
|
8B | 2025-04-29 | 19.47 | 6.36 | 25.91 | 9.22 |
3 |
Qwen3-30B-A3B
Qwen
|
30B | 2025-04-29 | 35.84 | 12.76 | 39.45 | 15.20 |
4 |
Qwen3-30B-A3B🧠
Qwen
|
30B | 2025-04-29 | 23.63 | 8.30 | 31.00 | 11.60 |
5 |
o3🧠
OpenAI
|
- | 2025-04-16 | 51.21 | 21.86 | 59.05 | 28.98 |
6 |
QwQ-32B-Preview 🧠
Qwen
|
32B | 2025-03-06 | 4.50 | 0.70 | 8.90 | 1.22 |
7 |
Claude-3.7-Sonnet
Anthropic
|
- | 2025-02-25 | 53.09 | 19.74 | 56.35 | 24.25 |
8 |
Claude-3.7-Sonnet 🧠
Anthropic
|
- | 2025-02-25 | 49.47 | 22.51 | 56.58 | 29.28 |
9 |
grok-3-beta 🧠
xAI
|
- | 2025-02-19 | 53.63 | 18.97 | 59.08 | 22.26 |
10 |
Deepseek-R1-distill-Qwen2.5-7B 🧠
deepseek-ai
|
7B | 2025-01-20 | 6.30 | 1.29 | 10.30 | 1.95 |
11 |
Deepseek-R1-distill-Qwen2.5-32B 🧠
deepseek-ai
|
32B | 2025-01-20 | 24.25 | 9.79 | 40.53 | 19.04 |
12 |
DeepSeek-R1-distill-Llama-70B 🧠
deepseek-ai
|
70B | 2025-01-20 | 32.73 | 8.19 | 45.72 | 11.33 |
13 |
Deepseek-R1 🧠
deepseek-ai
|
- | 2025-01-20 | 28.55 | 12.84 | 37.62 | 17.72 |
14 |
Deepseek-V3
deepseek-ai
|
- | 2024-12-26 | 41.95 | 16.22 | 56.79 | 21.62 |
15 |
Phi-4
microsoft
|
14B | 2024-12-12 | 21.99 | 5.57 | 27.89 | 8.56 |
16 |
Llama3.3-70B-Instrcut
microsoft
|
70B | 2024-12-06 | 33.84 | 12.85 | 39.57 | 14.95 |
17 |
o1 🧠
OpenAI
|
- | 2024-12-05 | 36.36 | 11.09 | 43.77 | 14.27 |
18 |
Qwen2.5-Coder-14B-Instruct
Qwen
|
14B | 2024-11-12 | 39.51 | 14.82 | 52.49 | 18.44 |
19 |
Qwen2.5-1.5B-Instruct
Qwen
|
1.5B | 2024-09-19 | 8.05 | 1.23 | 10.76 | 2.22 |
20 |
Qwen2.5-3B-Instruct
Qwen
|
3B | 2024-09-19 | 15.93 | 5.27 | 21.99 | 7.47 |
21 |
Qwen2.5-7B-Instruct
Qwen
|
7B | 2024-09-19 | 25.74 | 6.68 | 33.35 | 7.73 |
22 |
Qwen2.5-14B-Instruct
Qwen
|
14B | 2024-09-19 | 38.08 | 13.16 | 46.32 | 15.89 |
23 |
Qwen2.5-32B-Instruct
Qwen
|
32B | 2024-09-19 | 43.64 | 10.15 | 51.24 | 11.69 |
24 |
Qwen2.5-72B-Instruct
Qwen
|
72B | 2024-09-19 | 49.01 | 10.62 | 57.20 | 12.33 |
25 |
Llama-3.1-8B-Instruct
meta-llama
|
8B | 2024-07-23 | 26.43 | 7.94 | 33.01 | 10.24 |
26 |
GPT-4o-mini
OpenAI
|
- | 2024-07-18 | 34.47 | 11.09 | 41.94 | 13.84 |
27 |
DeepSeek-Coder-V2-Lite-Instruct
deepseek-a
|
- | 2024-06-17 | 21.53 | 8.19 | 29.68 | 11.33 |
28 |
GPT-4o
OpenAI
|
- | 2024-05-13 | 54.37 | 19.13 | 68.70 | 21.91 |
# | Method | Date | Easy (%) | Hard (%) | ||||
---|---|---|---|---|---|---|---|---|
Pass@1 | Calls | Prices($) | Pass@1 | Calls | Prices($) | |||
1 |
Reflexion
|
2023-10-10 | 39.77 | 2.12 | 0.83 | 13.32 | 2.18 | 1.35 |
1 |
Self Refine
|
2023-05-26 | 40.02 | 5.00 | 5.78 | 20.03 | 5.00 | 5.8 |
1 |
Self Consistency
|
2023-03-08 | 37.62 | 6.00 | 4.30 | 18.55 | 6.00 | 7.08 |
1 |
LLM Debate
|
2023-05-24 | 38.48 | 7.00 | 5.95 | 14.56 | 7.00 | 9.35 |
1 |
MAD
|
2024-10-09 | 31.50 | 7.00 | 2.48 | 15.31 | 7.00 | 3.40 |
1 |
Agentverse
|
2023-10-23 | 38.67 | 4.52 | 1.40 | 13.42 | 4.83 | 2.90 |
1 |
EvoMAC
|
2024-10-22 | 34.59 | 7.98 | 3.20 | 13.60 | 8.30 | 4.65 |
1 |
MetaGPT
|
2024-12-01 | 29.56 | 9.69 | 2.20 | 9.25 | 10.37 | 4.95 |
1 |
MapCoder
|
2024-05-19 | 24.55 | 21.01 | 6.05 | 5.87 | 23.41 | 10.55 |
1 |
ChatDev
|
2024-06-05 | 35.13 | 26.61 | 3.53 | 11.70 | 30.87 | 6.10 |
📍We use 🧠 to represent the Reasoning Model.
We have compared the Pass@3 scores for 17 chatbot LLMs and 10 reasoning LLMs on SWE-Dev.
We evaluate SWE-Dev's support for different training methods, including SFT, RL.
MAS has shown promising results on SWE-Dev, and we further investigate the training process of MAS on this dataset. As depicted in Fig. 7, the ground truth test case supervision in SWE-Dev enables EvoMAC to improve its performance across multiple rounds of reasoning. This iterative refinement process motivates us to explore EvoMAC as the MAS for training in SWE-Dev. We apply rejection sampling to enhance agent performance via role-wise training.
Step 1: We collect real-world repositories with passing test files in Dockerized environments
Step 2: trace test executions to construct function-level call trees linking test cases to invoked source code
Step 3: mask core functions while generating refined PRDs to create tasks. Each sample includes an incomplete repository, a natural language requirement, and executable test cases-enabling realistic, verifiable feature development.
SWE-Dev consists of 14,000 training samples and 500 test samples derived from over 1,000 open-source projects, with the test set manually curated and split into easy (250 instances) and hard (250 instances) difficulty levels. The dataset demonstrates substantial scale and complexity, with each sample requiring an average of 190 lines of code across 3 files and involving approximately 6 target functions for implementation. The average Project Requirement Document (PRD) contains 1,845 tokens, and each sample includes around 6 unit tests for functional evaluation.
@article{du2025swe,
title={SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development},
author={Du, Yaxin and Cai, Yuzhu and Zhou, Yifan and Wang, Cheng and Qian, Yu and Pang, Xianghe and Liu, Qian and Hu, Yue and Chen, Siheng},
journal={arXiv preprint arXiv:2505.16975},
year={2025}
}