Research
Papers and publications from the Code|AI team.
DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models↗
2026-01-17Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, Elsie Nallipogu• NeurIPS 2025 Workshop
DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry, such as API usage and code purpose understanding. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement-detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.
SWE-bench Goes Live!↗
2025-05-29Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang• NeurIPS 2025
We present SWE-bench-Live, a live-updatable benchmark for issue-resolving tasks where models generate patches for real-world bugs. The initial release contains 1,319 tasks from 93 repositories, each paired with a dedicated Docker image for reproducible execution. The benchmark uses an automated curation pipeline to reduce manual bottlenecks and support continuous updates while mitigating overfitting and contamination risks. Evaluations of modern agent frameworks and LLMs show a substantial performance gap versus static benchmarks, highlighting the challenge of dynamic, real-world software maintenance tasks.
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation↗
2025-01-27Xing Zhang, Jiaheng Wen, Fangkai Yang, Pu Zhao, Yu Kang, Junhao Wang, Maoquan Wang, Yufan Huang, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang, Qi Zhang• Findings of ACL: EMNLP 2025
We introduce Skeleton-Guided-Translation, a repository-level Java-to-C# code translation benchmarking framework with fine-grained quality evaluation. The framework translates repository skeletons first and then refines full repositories guided by those skeletons. Built on this method, TRANSREPO-BENCH provides high-quality Java repositories paired with C# skeletons, unit tests, and build configurations, alongside adaptive tests that support incremental translation. The paper also proposes fine-grained per-test-case metrics to better diagnose translation quality beyond binary build outcomes, and shows improvements in dependency handling and interface consistency.
Di-bench: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale↗
2025-01-23Linghao Zhang, Junhao Wang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Jiaheng Wen, Chengxing Xie, Maoquan Wang, Yufan Huang, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang, Qi Zhang• Findings of ACL 2025
DI-BENCH is a large-scale benchmark for evaluating LLM capability in repository-level dependency inference, including identifying required internal components and external packages. It contains 581 repositories with testable environments across Python, C#, Rust, and JavaScript. The benchmark combines textual and execution-based metrics and highlights that even strong models achieve limited execution success, such as around 48% pass rate on Python in reported experiments. This work surfaces dependency inference as a major bottleneck for robust end-to-end software synthesis.
SUT: Active Defects Probing for Transcompiler Models↗
2023-12-01Mengnan Qi, Yufan Huang, Maoquan Wang, Yongqiang Yao, Zihan Liu, Bin Gu, Colin Clement, Neel Sundaresan• EMNLP 2023
This paper introduces SUT (Syntactic Unit Tests), an active defects probing suite for evaluating program translation models. It focuses on elementary syntax errors that are often missed by standard metrics such as BLEU, CodeBLEU, and computation accuracy. The proposed evaluation harness provides interpretable test scoring and shows that even strong models like ChatGPT struggle on these syntactic tests, with substantially lower pass rates than on prior translation benchmarks. The benchmark helps expose fine-grained syntactic weaknesses in transcompiler systems.
Program Translation via Code Distillation↗
2023-10-17Yufan Huang, Mengnan Qi, Yongqiang Yao, Maoquan Wang, Bin Gu, Colin Clement, Neel Sundaresan• EMNLP 2023
This paper proposes Code Distillation (CoDist), a language-agnostic intermediate representation that acts as a translation pivot for program translation. By distilling source code into a semantic and structural form, the method constructs scalable parallel corpora without requiring manually aligned training data. The approach addresses challenges in noisy snippet alignment and intermediate-representation diversity found in prior unsupervised translation methods. Experiments report state-of-the-art results on CodeXGLUE and TransCoder GeeksForGeeks benchmarks, including substantial average gains over strong baselines.