AutoDev Coder

January 7, 2024 · 2 min read

TL;DR:

The first barely usable version of AutoDev Coder 6.7B, a coding LLM for AutoDev, is now available.

HuggingFace homepage: https://huggingface.co/unit-mesh (temporarily unable to provide direct downloads due to certification requirements 🐶🐶).
Dataset download: https://huggingface.co/datasets/unit-mesh/autodev-datasets

PS: Since AutoDev 1.5.1 is awaiting approval on the JetBrains Marketplace and foreign colleagues are still on vacation after holidays, the model's performance on version 1.5.1 will be slightly better than on 1.5.0.

Additionally, with improved computing power support and better completion testing, we will reintroduce the original Inlay completion mode.

AutoDev Coder 6.7B v1 Experimental Version

Current version is fine-tuned based on DeepSeek Coder 6.7b instruct model under LLaMA architecture.

Note: As an experimental version, its primary purpose is to align the model, data tools, and IDE plugin for better coordination. Generation quality still requires further improvement.

AutoDev Coder 64k Dataset

The instruction composition of AutoDev Coder v1 64k is as follows:

Filename	Selected Instructions
java_oss.jsonl	4000
python_oss.jsonl	4000
code_bugfix_cleaned_5K.json	4000
codeGPT_CN_cleaned_20K.json	15000
code_summarization_CN_cleaned_10K.json	8000
code_generation_CN_cleaned_5K.json	4000
summary.jsonl	25000

The summary.jsonl is generated by our open-source code fine-tuning data framework UnitGen (https://github.com/unit-mesh/unit-gen).

We selected dozens of Java and Kotlin open-source projects, generating instructions based on AutoDev plugin requirements, mainly categorized into three types:

Completion (inline, interline, interblock)
Documentation generation
Comment generation

Detailed documentation can be found in the UnitGen project: https://github.com/unit-mesh/unit-gen.

FAQ: AutoDev Coder Model Evaluation

Still under design. Since we need to combine AutoDev instructions with languages like Java, Kotlin, TypeScript rather than the Python-centric systems commonly used in open-source models, we need to rethink our evaluation approach.

Initially, we used instruction sets like OSS Instruct to supplement natural language to code generation, but found ~50,000 instructions (about 50%) were Python-related. After filtering, only ~5,000 Java instructions remained, which showed suboptimal results in AutoDev.

FAQ: AutoDev Instructions

AutoDev employs contextual strategies that differ from other tools in instruction handling. Details: https://github.com/unit-mesh/auto-dev

AutoDev Coder 6.7B v1 Experimental Version​

AutoDev Coder 64k Dataset​

FAQ: AutoDev Coder Model Evaluation​

FAQ: AutoDev Instructions​

AutoDev Coder 6.7B v1 Experimental Version

AutoDev Coder 64k Dataset

FAQ: AutoDev Coder Model Evaluation

FAQ: AutoDev Instructions