Since the release of Mixtral-8x7B by Mistral AI, there has been a renewed interest in the mixture of expert (MoE) models. This architecture exploits expert sub-networks among which only some of them are selected and activated by a router network during inference.
Model merging is a technique that combines two or more LLMs into a single model. It’s a relatively new and experimental method to create new models for cheap (no GPU required). Model merging works surprisingly well and produced many state-of-the-art models on the Open LLM Leaderboard.
MoEs are so simple and flexible that it is easy to make a custom MoE. On the Hugging Face Hub, we can now find several trending LLMs that are custom MoEs, such as mlabonne/phixtral-4x2_8.
Model Architecture mlabonne/phixtral-4x2_8:
From the above architecture we can see MoE with four MLPs, i.e., using four expert sub-networks. Only the MLP modules are specific to each expert.
However, most of them are not traditional MoEs made from scratch, they simply use a combination of already fine-tuned LLMs as experts. Their creation was made easy with mergekit. For instance, Phixtral LLMs have been made with mergekit by combining several Phi-2 models.
In this tutorial, we will implement it using the mergekit library.
What is a Mixture of Experts (MoE)?
The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.
So, what exactly is a MoE? In the context of transformer models, a MoE consists of two main elements:
Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!
A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token “Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs — the router is composed of learned parameters and is pretrained at the same time as the rest of the network.
What is Mergekit?
Mergekit is a free Gihub project that aims to be able to create merges of pre-trained models that “can be run entirely on CPU or accelerated with as little as 8 GB of VRAM. Many merging algorithms are supported”
Features:
Supports Llama, Mistral, GPT-NeoX, StableLM, and more
Many merge methods
GPU or CPU execution
Lazy loading of tensors for low memory use
Interpolated gradients for parameter values (inspired by Gryphe’s BlockMerge_Gradient script)
Piecewise assembly of language models from layers (“Frankenmerging”)
There is an automated notebook to easily run mergekit: 🥱 LazyMergekit. But here we will execute the code using the pais GPU instances available on RunPOD
Here we will leverage the mergekit-moe configuration to create our own MoE model.
What is mergekit-moe ?
mergekit-moe
is a script for combining Mistral or Llama models of the same size into Mixtral Mixture of Experts models. The script will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models. mergekit-moe
uses its own YML configuration syntax, which looks like so:
We can define the prompts that will help to activate the right expert. In the configuration (above), positive_prompts is a list of prompt examples for which we would like the router network to select the corresponding expert. At inference time, when the user enters prompts semantically close to the positive_prompts, the model’s router network will activate the right expert model.
Gate Modes:
There are three methods for populating the MoE gates implemented.
“hidden”
Uses the hidden state representations of the positive/negative prompts for MoE gate parameters. Best quality and most effective option; the default. Requires evaluating each prompt using the base model so you might not be able to use this on constrained hardware (depending on the model). You can use --load-in-8bit
or --load-in-4bit
to reduce VRAM usage.
“cheap_embed”
Uses only the raw token embedding of the prompts, using the same gate parameters for every layer. Distinctly less effective than “hidden”. Can be run on much, much lower end hardware.
“random”
Randomly initializes the MoE gates. Good for if you are going to fine tune the model afterwards, or maybe if you want something a little unhinged? I won’t judge.
Here we have arbitraily chose four Mistral 7B models .
teknium/OpenHermes-2.5-Mistral-7B: This model has been fine-tuned for code generation.
mistralai/Mistral-7B-Instruct-v0.2: The instruct version fine-tuned by Mistral AI.
meta-math/MetaMath-Mistral-7B: This model has been fine-tuned for math.
HuggingFaceH4/zephyr-7b-beta: Another instruct version trained with DPO on UltraFeedback.
We will implement the code in RunPOD.
What is RunPOD ?
RunPod is a cloud computing platform, primarily designed for AI and machine learning applications. RunPod’s key offerings include Pods, Serverless compute, and AI APIs.
Code Implementation
Install required Dependencies
Prepare the config.yaml file
This is based on the mergekit-moe YML configuration syntax
We then run the merge command with the following parameters:
--copy-tokenizer
to copy the tokenizer from the base model--allow-crimes
and--out-shard-size
to chunk the models into smaller shards that can be computed on a CPU with low RAM--lazy-unpickle
to enable the experimental lazy unpickler for lower memory usage
In addition, some models can require the --trust_remote_code
flag (this is not the case with Mistral-7B).
This command will download the weights of all the models listed in the merge configuration and run the selected merge method
The merge itself only requires a CPU but note that you will need a lot of space on your disk since we have to download all the experts.
The model is now merged and saved in the `merge` directory.
Test inference and push the model to the HF Hub
Load the merged model
Login to Huggingface Hub
Push the model to Huggingface Hub
specify the <repoid/merged-model-name>
Test the Merged Model
Helper Function to generate response
Ask Query 1
Ask Query 2
Ask Query 3
Ask Query 4
Ask Query 5
AutoEvaluate the merged model
Model evaluation is an essential aspect of developing and refining models. Various evaluation frameworks and benchmarks have been designed to assess the different capabilities of these models.
In order to autoevaluate we will leverage the colab notebook developed by 🧐 LLM AutoEval
We need to provide tokens for RunPod and GitHub.
We need to prepay your account before you can kick off a run. Without prepaying, your workload terminates after a few seconds without warning or message.
AutoEval uses RunPod to execute the model evaluation. Upon clicking the “Run” button in the AutoEval notebook, we will be instructed that your pod has started on the URL https://www.runpod.io/console/pods, where you will see your running pod instance.
What happens under the hood of AutoEval is three steps:
Automated setup and execution using RunPod.
Customizable evaluation parameters for tailored benchmarking.
Generate a summary and upload it to GitHub Gist for easy sharing and reference.
AutoEval uses the nous
benchmark suite, which contains the following list of tasks:
AGIEval: a human-centric benchmark designed to evaluate foundation models’ general abilities in tasks pertinent to human cognition and problem-solving. AGIEval v1.0 contains 20 tasks, including two cloze tasks (Gaokao-Math-Cloze and MATH) and 18 multi-choice question answering tasks.
GPT4ALL: a benchmark suite for evaluating the factual language understanding of LLMs. It consists of various tasks designed to assess an LLM’s ability to understand and respond to factual queries. Some of the tasks in GPT4All include open-ended question answering, closed-ended question-answering, text summarization, and natural language inference.
TruthfulQA: a benchmark suite for evaluating the truthfulness of LLMs. It consists of various tasks designed to assess an LLM’s ability to distinguish between true and false statements. Some of the tasks in TruthfulQA include multiple-choice questions and textual entailment.
BigBench: an extensive benchmark suite that aims to evaluate and measure the capabilities of models across a wide range of tasks. It includes tests for reasoning, language understanding, problem-solving, and more. The idea behind BigBench is to provide a comprehensive and challenging set of tasks that can reveal the strengths and weaknesses of AI models in various domains.
On executing the autoevalaution script the below message will be displayed:
Pod started: https://www.runpod.io/console/pods
Login to runpod account and you can see the pods running
Click on logs to see the details of execution
The entire process took 1 hr 53 seconds to complete and the following were the results written into the github repository as .md file
Comparison with the expert1 model alone
Conclusion
Creating Mixture of Experts (MoEs) is now both easy and affordable. Although mergekit currently supports only a few model types, its popularity suggests that more will be added soon.
In this article, we combined different models and used the resulting model for making predictions. While our new MoE gives good results, it’s important to note that we haven’t fine-tuned it yet. To make the results even better, fine-tuning it with QLoRA would be a good idea.
References
GitHub - cg123/mergekit: Tools for merging pretrained large language models.