Abstract

Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on hard split, underscoring the value of its high-quality training data. Read more about SWE-Dev in our paper!

Features

1 Realistic scale and complexity

SWE-Dev requires substantial code modifications (avg. 190 LOC across 3 files), challenging models with the cross-file dependencies, large contexts, and significant implementation scope characteristic of real-world feature development.

2 Robust and grounded evaluation

Each SWE-Dev sample is grounded in a real open-source repository, guided by a well-defined project requirement description (PRD), and evaluated using executable test cases to ensure the functional correctness of the proposed implementation. This design ensures alignment between task objectives and evaluation, enabling robust assessment and model supervision.

3 Verifiable training set with executable test suites

Uniquely, all 14,000 training instances are paired with runnable environments and executable unit tests, providing crucial execution-based feedback that support:

✅ Supervised Fine-Tuning (SFT)

✅ Reinforcement Learning (RL) with accurate rewards

✅ Multi-Agent System (MAS) training

Leaderboard

#	Model	Params	Date	Pass @1 (%)		Pass @3 (%)
#	Model	Params	Date	Easy	Hard	Easy	Hard
1	Qwen3-8B Qwen	8B	2025-04-29	34.04	12.09	39.26	13.33
2	Qwen3-8B🧠 Qwen	8B	2025-04-29	19.47	6.36	25.91	9.22
3	Qwen3-30B-A3B Qwen	30B	2025-04-29	35.84	12.76	39.45	15.20
4	Qwen3-30B-A3B🧠 Qwen	30B	2025-04-29	23.63	8.30	31.00	11.60
5	o3🧠 OpenAI	-	2025-04-16	51.21	21.86	59.05	28.98
6	QwQ-32B-Preview 🧠 Qwen	32B	2025-03-06	4.50	0.70	8.90	1.22
7	Claude-3.7-Sonnet Anthropic	-	2025-02-25	53.09	19.74	56.35	24.25
8	Claude-3.7-Sonnet 🧠 Anthropic	-	2025-02-25	49.47	22.51	56.58	29.28
9	grok-3-beta 🧠 xAI	-	2025-02-19	53.63	18.97	59.08	22.26
10	Deepseek-R1-distill-Qwen2.5-7B 🧠 deepseek-ai	7B	2025-01-20	6.30	1.29	10.30	1.95
11	Deepseek-R1-distill-Qwen2.5-32B 🧠 deepseek-ai	32B	2025-01-20	24.25	9.79	40.53	19.04
12	DeepSeek-R1-distill-Llama-70B 🧠 deepseek-ai	70B	2025-01-20	32.73	8.19	45.72	11.33
13	Deepseek-R1 🧠 deepseek-ai	-	2025-01-20	28.55	12.84	37.62	17.72
14	Deepseek-V3 deepseek-ai	-	2024-12-26	41.95	16.22	56.79	21.62
15	Phi-4 microsoft	14B	2024-12-12	21.99	5.57	27.89	8.56
16	Llama3.3-70B-Instrcut microsoft	70B	2024-12-06	33.84	12.85	39.57	14.95
17	o1 🧠 OpenAI	-	2024-12-05	36.36	11.09	43.77	14.27
18	Qwen2.5-Coder-14B-Instruct Qwen	14B	2024-11-12	39.51	14.82	52.49	18.44
19	Qwen2.5-1.5B-Instruct Qwen	1.5B	2024-09-19	8.05	1.23	10.76	2.22
20	Qwen2.5-3B-Instruct Qwen	3B	2024-09-19	15.93	5.27	21.99	7.47
21	Qwen2.5-7B-Instruct Qwen	7B	2024-09-19	25.74	6.68	33.35	7.73
22	Qwen2.5-14B-Instruct Qwen	14B	2024-09-19	38.08	13.16	46.32	15.89
23	Qwen2.5-32B-Instruct Qwen	32B	2024-09-19	43.64	10.15	51.24	11.69
24	Qwen2.5-72B-Instruct Qwen	72B	2024-09-19	49.01	10.62	57.20	12.33
25	Llama-3.1-8B-Instruct meta-llama	8B	2024-07-23	26.43	7.94	33.01	10.24
26	GPT-4o-mini OpenAI	-	2024-07-18	34.47	11.09	41.94	13.84
27	DeepSeek-Coder-V2-Lite-Instruct deepseek-a	-	2024-06-17	21.53	8.19	29.68	11.33
28	GPT-4o OpenAI	-	2024-05-13	54.37	19.13	68.70	21.91

#	Method	Date	Easy (%)			Hard (%)
#	Method	Date	Pass@1	Calls	Prices($)	Pass@1	Calls	Prices($)
1	Reflexion	2023-10-10	39.77	2.12	0.83	13.32	2.18	1.35
1	Self Refine	2023-05-26	40.02	5.00	5.78	20.03	5.00	5.8
1	Self Consistency	2023-03-08	37.62	6.00	4.30	18.55	6.00	7.08
1	LLM Debate	2023-05-24	38.48	7.00	5.95	14.56	7.00	9.35
1	MAD	2024-10-09	31.50	7.00	2.48	15.31	7.00	3.40
1	Agentverse	2023-10-23	38.67	4.52	1.40	13.42	4.83	2.90
1	EvoMAC	2024-10-22	34.59	7.98	3.20	13.60	8.30	4.65
1	MetaGPT	2024-12-01	29.56	9.69	2.20	9.25	10.37	4.95
1	MapCoder	2024-05-19	24.55	21.01	6.05	5.87	23.41	10.55
1	ChatDev	2024-06-05	35.13	26.61	3.53	11.70	30.87	6.10

📍We use 🧠 to represent the Reasoning Model.

We have compared the Pass@3 scores for 17 chatbot LLMs and 10 reasoning LLMs on SWE-Dev.

1 SWE-Dev Poses High Challenge

SWE-Dev presents substantial challenges for current LLMs, with even the best-performing Claude-3.7-Sonnet achieving only 22.45% Pass@3 on the Hard split, revealing a clear gap between existing AI coding capabilities and real-world software engineering demands.

2 Model Performance Patterns

All LLMs perform better on the Easy split than the Hard split, and performance generally scales with model size within the same family, demonstrating that SWE-Dev can effectively distinguish different model capabilities and aligns with our understanding of LLM capability scaling.

3 Reasoning Model Performance

Reasoning models generally underperform their corresponding base chatbot models, with Claude-3.7-Sonnet being the exception, indicating that current reasoning strategies do not consistently translate into gains for complex, repository-level generation tasks.

Training Support

1️⃣ Single Agent Training

SFT

RL

We evaluate SWE-Dev's support for different training methods, including SFT, RL.

2️⃣ MultiAgent Training

MAS has shown promising results on SWE-Dev, and we further investigate the training process of MAS on this dataset. As depicted in Fig. 7, the ground truth test case supervision in SWE-Dev enables EvoMAC to improve its performance across multiple rounds of reasoning. This iterative refinement process motivates us to explore EvoMAC as the MAS for training in SWE-Dev. We apply rejection sampling to enhance agent performance via role-wise training.

Dataset Construction

Step 1: We collect real-world repositories with passing test files in Dockerized environments

Step 2: trace test executions to construct function-level call trees linking test cases to invoked source code

Step 3: mask core functions while generating refined PRDs to create tasks. Each sample includes an incomplete repository, a natural language requirement, and executable test cases-enabling realistic, verifiable feature development.

SWE-Dev consists of 14,000 training samples and 500 test samples derived from over 1,000 open-source projects, with the test set manually curated and split into easy (250 instances) and hard (250 instances) difficulty levels. The dataset demonstrates substantial scale and complexity, with each sample requiring an average of 190 lines of code across 3 files and involving approximately 6 target functions for implementation. The average Project Requirement Document (PRD) contains 1,845 tokens, and each sample includes around 6 unit tests for functional evaluation.

BibTeX

@article{du2025swe,
  title={SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development},
  author={Du, Yaxin and Cai, Yuzhu and Zhou, Yifan and Wang, Cheng and Qian, Yu and Pang, Xianghe and Liu, Qian and Hu, Yue and Chen, Siheng},
  journal={arXiv preprint arXiv:2505.16975},
  year={2025}
}

SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

Code and data for paper: SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development