Skip to main content

AutoDev Coder

· 2 min read

TL;DR:

The first barely usable version of AutoDev Coder 6.7B, a coding LLM for AutoDev, is now available.

PS: Since AutoDev 1.5.1 is awaiting approval on the JetBrains Marketplace and foreign colleagues are still on vacation after holidays, the model's performance on version 1.5.1 will be slightly better than on 1.5.0.

Additionally, with improved computing power support and better completion testing, we will reintroduce the original Inlay completion mode.

AutoDev Coder 6.7B v1 Experimental Version

Current version is fine-tuned based on DeepSeek Coder 6.7b instruct model under LLaMA architecture.

Note: As an experimental version, its primary purpose is to align the model, data tools, and IDE plugin for better coordination. Generation quality still requires further improvement.

AutoDev Coder 64k Dataset

The instruction composition of AutoDev Coder v1 64k is as follows:

FilenameSelected Instructions
java_oss.jsonl4000
python_oss.jsonl4000
code_bugfix_cleaned_5K.json4000
codeGPT_CN_cleaned_20K.json15000
code_summarization_CN_cleaned_10K.json8000
code_generation_CN_cleaned_5K.json4000
summary.jsonl25000

The summary.jsonl is generated by our open-source code fine-tuning data framework UnitGen (https://github.com/unit-mesh/unit-gen).

We selected dozens of Java and Kotlin open-source projects, generating instructions based on AutoDev plugin requirements, mainly categorized into three types:

  • Completion (inline, interline, interblock)
  • Documentation generation
  • Comment generation

Detailed documentation can be found in the UnitGen project: https://github.com/unit-mesh/unit-gen.

FAQ: AutoDev Coder Model Evaluation

Still under design. Since we need to combine AutoDev instructions with languages like Java, Kotlin, TypeScript rather than the Python-centric systems commonly used in open-source models, we need to rethink our evaluation approach.

Initially, we used instruction sets like OSS Instruct to supplement natural language to code generation, but found ~50,000 instructions (about 50%) were Python-related. After filtering, only ~5,000 Java instructions remained, which showed suboptimal results in AutoDev.

FAQ: AutoDev Instructions

AutoDev employs contextual strategies that differ from other tools in instruction handling. Details: https://github.com/unit-mesh/auto-dev