AutoDev Coder
TL;DR:
The first barely usable version of AutoDev Coder 6.7B, a coding LLM for AutoDev, is now available.
- HuggingFace homepage: https://huggingface.co/unit-mesh (temporarily unable to provide direct downloads due to certification requirements 🐶🐶).
- Dataset download: https://huggingface.co/datasets/unit-mesh/autodev-datasets
PS: Since AutoDev 1.5.1 is awaiting approval on the JetBrains Marketplace and foreign colleagues are still on vacation after holidays, the model's performance on version 1.5.1 will be slightly better than on 1.5.0.
Additionally, with improved computing power support and better completion testing, we will reintroduce the original Inlay completion mode.
AutoDev Coder 6.7B v1 Experimental Version
Current version is fine-tuned based on DeepSeek Coder 6.7b instruct model under LLaMA architecture.
Note: As an experimental version, its primary purpose is to align the model, data tools, and IDE plugin for better coordination. Generation quality still requires further improvement.
AutoDev Coder 64k Dataset
The instruction composition of AutoDev Coder v1 64k is as follows:
Filename | Selected Instructions |
---|---|
java_oss.jsonl | 4000 |
python_oss.jsonl | 4000 |
code_bugfix_cleaned_5K.json | 4000 |
codeGPT_CN_cleaned_20K.json | 15000 |
code_summarization_CN_cleaned_10K.json | 8000 |
code_generation_CN_cleaned_5K.json | 4000 |
summary.jsonl | 25000 |
The summary.jsonl is generated by our open-source code fine-tuning data framework UnitGen (https://github.com/unit-mesh/unit-gen).
We selected dozens of Java and Kotlin open-source projects, generating instructions based on AutoDev plugin requirements, mainly categorized into three types:
- Completion (inline, interline, interblock)
- Documentation generation
- Comment generation
Detailed documentation can be found in the UnitGen project: https://github.com/unit-mesh/unit-gen.
FAQ: AutoDev Coder Model Evaluation
Still under design. Since we need to combine AutoDev instructions with languages like Java, Kotlin, TypeScript rather than the Python-centric systems commonly used in open-source models, we need to rethink our evaluation approach.
Initially, we used instruction sets like OSS Instruct to supplement natural language to code generation, but found ~50,000 instructions (about 50%) were Python-related. After filtering, only ~5,000 Java instructions remained, which showed suboptimal results in AutoDev.
FAQ: AutoDev Instructions
AutoDev employs contextual strategies that differ from other tools in instruction handling. Details: https://github.com/unit-mesh/auto-dev