Empower every developer to help shape the future with AI
A team at Microsoft CoreAI researching and building the future of AI-powered development tools and experiences.
Latest from the Blog
View all →Welcome to Code|AI
Introducing the Code|AI team blog — where we share our journey building AI-powered tools for every developer on the planet.
Building Long-Distance Next Edit Suggestions↗
How we extended GitHub Copilot's Next Edit Suggestions to predict and suggest edits anywhere in a file — not just near the cursor — using a multi-model approach.
Evolving GitHub Copilot's Next Edit Suggestions Through Custom Model Training↗
How we evolved next edit suggestions in GitHub Copilot through custom model training techniques.
Recent Research
View all →DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models↗
DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry, such as API usage and code purpose understanding. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement-detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.
Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, Elsie Nallipogu
SWE-bench Goes Live!↗
We present SWE-bench-Live, a live-updatable benchmark for issue-resolving tasks where models generate patches for real-world bugs. The initial release contains 1,319 tasks from 93 repositories, each paired with a dedicated Docker image for reproducible execution. The benchmark uses an automated curation pipeline to reduce manual bottlenecks and support continuous updates while mitigating overfitting and contamination risks. Evaluations of modern agent frameworks and LLMs show a substantial performance gap versus static benchmarks, highlighting the challenge of dynamic, real-world software maintenance tasks.
Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation↗
We introduce Skeleton-Guided-Translation, a repository-level Java-to-C# code translation benchmarking framework with fine-grained quality evaluation. The framework translates repository skeletons first and then refines full repositories guided by those skeletons. Built on this method, TRANSREPO-BENCH provides high-quality Java repositories paired with C# skeletons, unit tests, and build configurations, alongside adaptive tests that support incremental translation. The paper also proposes fine-grained per-test-case metrics to better diagnose translation quality beyond binary build outcomes, and shows improvements in dependency handling and interface consistency.
Xing Zhang, Jiaheng Wen, Fangkai Yang, Pu Zhao, Yu Kang, Junhao Wang, Maoquan Wang, Yufan Huang, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Build the future with us
We're looking for passionate researchers and engineers who want to shape how developers work with AI. Join Code|AI at Microsoft.
View Open Positions