DCL Codepilot – LLM Fine Tuning on SDK7

Hey everyone,

Recently, I experimented with ways to get a large language model to generate proper Decentraland scene code. I tried using GPT-4, LLaMA-2, Gemini, Claude, and Code Llama, but I reached a roadblock.

Strategy 1: Prompting (did not work)
First, I tried prompt engineering. With this approach, you aim to provide as much information and guidance to the AI as possible within the token limit. However, this method has its limitations due to the restricted context window. Additionally, the AI often ignores new coding patterns and reverts to older ones, as these are more statistically significant due to its training data. This approach is only effective if you consistently hand-feed the documentation and examples each time you want to generate something.

Strategy2: RAG (did not work)
My next step was to implement a RAG system (Retrieval-Augmented Generation). This system is also used by ChatGPT for custom GPTs operating on your own data. You give it the documentation, and the AI retrieves the relevant paragraph through an API. The returned excerpt is added to the context window of the prompt. This method is excellent for retrieving specific knowledge, such as facts, numbers, and explanations, but it is less effective when you need to reference multiple functions. Essentially, it’s an automated form of hand feeding the documentation. However, RAG does not address the main issue: learning the new coding patterns of SDK7, as it tends to revert to the SDK6 style.

Strategy3: Fine-Tuning (looks promising)
Then I looked deeper into how to train my own LLM. It turns out that you don’t need to train it from scratch. It is sufficient to get an open-source model like LLama-3 and train only the upper layers. This process is called fine-tuning. By using this approach, it is possible to teach the model new patterns, such as programming languages.

Why am I writing this? I’ve already attempted to fine-tune a model on my Mac laptop. While it’s feasible, it’s not very efficient, and I’m concerned about the potential damage to the hardware due to the intense load. That’s why I’m thinking of creating a proposal to rent specialized hardware and focus my whole time on this project.

Additionally I aim to establish a process pipeline that retrieves documentation and SDK7 repositories from GitHub, converts them into synthetic training data, and continuously trains the DCL-Copilot to ensure it remains up-to-date. The final open-source models will be compressed, allowing them to run on consumer-grade hardware.

You can check out what I’ve done so far on GitHub:

To enhance the utility, I am looking for someone who has experience creating a Visual Studio Code plugin that integrates with the model to auto-generate code, similar to GitHub Copilot.

Also, anyone interested in AI and automation is welcome to join :slight_smile:

Any questions, thoughts, or objections? I would love to hear your feedback!