Create Your Own Mixture of Experts Model with Mergekit and Runpod

Create Your Own Mixture of Experts Model with Mergekit and Runpod

Jan 26, 2024

Jan 26, 2024

Since the release of Mixtral-8x7B by Mistral AI, there has been a renewed interest in the mixture of expert (MoE) models. This architecture exploits expert sub-networks among which only some of them are selected and activated by a router network during inference.

Model merging is a technique that combines two or more LLMs into a single model. It’s a relatively new and experimental method to create new models for cheap (no GPU required). Model merging works surprisingly well and produced many state-of-the-art models on the Open LLM Leaderboard.

MoEs are so simple and flexible that it is easy to make a custom MoE. On the Hugging Face Hub, we can now find several trending LLMs that are custom MoEs, such as mlabonne/phixtral-4x2_8.

Model Architecture mlabonne/phixtral-4x2_8:

PhiForCausalLM(
  (transformer): PhiModel(
    (embd): Embedding(
      (wte): Embedding(51200, 2560)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (h): ModuleList(
      (0-31): 32 x ParallelBlock(
        (ln): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
        (mixer): MHA(
          (rotary_emb): RotaryEmbedding()
          (Wqkv): Linear4bit(in_features=2560, out_features=7680, bias=True)
          (out_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (inner_attn): SelfAttention(
            (drop): Dropout(p=0.0, inplace=False)
          )
          (inner_cross_attn): CrossAttention(
            (drop): Dropout(p=0.0, inplace=False)
          )
        )
        (moe): MoE(
          (mlp): ModuleList(
            (0-3): 4 x MLP(
              (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
              (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
              (act): NewGELUActivation()
            )
          )
          (gate): Linear4bit(in_features=2560, out_features=4, bias=False)
        )
      )
    )
  )
  (lm_head): CausalLMHead(
    (ln): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
    (linear): Linear(in_features=2560, out_features=51200, bias=True)
  )
  (loss): CausalLMLoss(
    (loss_fct): CrossEntropyLoss()
  )
)

From the above architecture we can see MoE with four MLPs, i.e., using four expert sub-networks. Only the MLP modules are specific to each expert.

However, most of them are not traditional MoEs made from scratch, they simply use a combination of already fine-tuned LLMs as experts. Their creation was made easy with mergekit. For instance, Phixtral LLMs have been made with mergekit by combining several Phi-2 models.

In this tutorial, we will implement it using the mergekit library.

What is a Mixture of Experts (MoE)?

The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.

Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.

So, what exactly is a MoE? In the context of transformer models, a MoE consists of two main elements:

  • Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!

  • A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token “Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs — the router is composed of learned parameters and is pretrained at the same time as the rest of the network.

What is Mergekit?

Mergekit is a free Gihub project that aims to be able to create merges of pre-trained models that “can be run entirely on CPU or accelerated with as little as 8 GB of VRAM. Many merging algorithms are supported”

Features:

  • Supports Llama, Mistral, GPT-NeoX, StableLM, and more

  • Many merge methods

  • GPU or CPU execution

  • Lazy loading of tensors for low memory use

  • Interpolated gradients for parameter values (inspired by Gryphe’s BlockMerge_Gradient script)

  • Piecewise assembly of language models from layers (“Frankenmerging”)

There is an automated notebook to easily run mergekit: 🥱 LazyMergekit. But here we will execute the code using the pais GPU instances available on RunPOD

Here we will leverage the mergekit-moe configuration to create our own MoE model.

What is mergekit-moe ?

mergekit-moe is a script for combining Mistral or Llama models of the same size into Mixtral Mixture of Experts models. The script will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models. mergekit-moe uses its own YML configuration syntax, which looks like so:

base_model: path/to/self_attn_donor
gate_mode: hidden # one of "hidden", "cheap_embed", or "random"
dtype: bfloat16 # output dtype (float32, float16, or bfloat16)
## (optional)
# experts_per_token: 2
experts:
  - source_model: expert_model_1
    positive_prompts:
      - "This is a prompt that is demonstrative of what expert_model_1 excels at"
    ## (optional)
    # negative_prompts:
    #   - "This is a prompt expert_model_1 should not be used for"
  - source_model: expert_model_2
  # ... and so on

We can define the prompts that will help to activate the right expert. In the configuration (above), positive_prompts is a list of prompt examples for which we would like the router network to select the corresponding expert. At inference time, when the user enters prompts semantically close to the positive_prompts, the model’s router network will activate the right expert model.

Gate Modes:

There are three methods for populating the MoE gates implemented.

“hidden”

Uses the hidden state representations of the positive/negative prompts for MoE gate parameters. Best quality and most effective option; the default. Requires evaluating each prompt using the base model so you might not be able to use this on constrained hardware (depending on the model). You can use --load-in-8bit or --load-in-4bit to reduce VRAM usage.

“cheap_embed”

Uses only the raw token embedding of the prompts, using the same gate parameters for every layer. Distinctly less effective than “hidden”. Can be run on much, much lower end hardware.

“random”

Randomly initializes the MoE gates. Good for if you are going to fine tune the model afterwards, or maybe if you want something a little unhinged? I won’t judge.

Here we have arbitraily chose four Mistral 7B models .

We will implement the code in RunPOD.

What is RunPOD ?

RunPod is a cloud computing platform, primarily designed for AI and machine learning applications. RunPod’s key offerings include Pods, Serverless compute, and AI APIs.

Code Implementation

Install required Dependencies

!git clone -b mixtral https://github.com/cg123/mergekit.git
!cd mergekit && pip install -qqq -e . --progress-bar off
!pip install -qqq -U transformers --progress-bar off
!pip install bitsandbytes accelerate

Prepare the config.yaml file

This is based on the mergekit-moe YML configuration syntax

merge_config = """
base_model: mistralai/Mistral-7B-Instruct-v0.2
dtype: float16
gate_mode: cheap_embed
experts:
  - source_model: HuggingFaceH4/zephyr-7b-beta
    positive_prompts: ["You are an helpful general-pupose assistant."]
  - source_model: mistralai/Mistral-7B-Instruct-v0.2
    positive_prompts: ["You are helpful assistant."]
  - source_model: teknium/OpenHermes-2.5-Mistral-7B
    positive_prompts: ["You are helpful a coding assistant."]
  - source_model: meta-math/MetaMath-Mistral-7B
    positive_prompts: ["You are an assistant good at math."]
"""

with open('config.yaml', 'w') as f:
    f.write(merge_config)

We then run the merge command with the following parameters:

  • --copy-tokenizer to copy the tokenizer from the base model

  • --allow-crimes and --out-shard-size to chunk the models into smaller shards that can be computed on a CPU with low RAM

  • --lazy-unpickle to enable the experimental lazy unpickler for lower memory usage

In addition, some models can require the --trust_remote_code flag (this is not the case with Mistral-7B).

This command will download the weights of all the models listed in the merge configuration and run the selected merge method

The merge itself only requires a CPU but note that you will need a lot of space on your disk since we have to download all the experts.

!mergekit-moe config.yaml merge --copy-tokenizer --allow-crimes --out-shard-size 1B --lazy-unpickle --trust-remote-code

######Sample log Information ##########
.......
.......
.......

pytorch_model-00001-of-00002.bin:  99%|████▉| 9.85G/9.94G [01:23<00:00, 131MB/s]


pytorch_model-00001-of-00002.bin:  99%|████▉| 9.87G/9.94G [01:23<00:00, 129MB/s]


pytorch_model-00001-of-00002.bin:  99%|████▉| 9.89G/9.94G [01:24<00:00, 126MB/s]


pytorch_model-00001-of-00002.bin: 100%|████▉| 9.91G/9.94G [01:24<00:00, 100MB/s]


pytorch_model-00001-of-00002.bin: 100%|█████| 9.94G/9.94G [01:24<00:00, 118MB/s]
Fetching 9 files: 100%|███████████████████████████| 9/9 [01:24<00:00,  9.40s/it]
Warm up loaders: 100%|███████████████████████████| 5/5 [09:48<00:00, 117.62s/it]
100%|█████████████████████████████████████████████| 9/9 [01:34<00:00, 10.44s/it]
Fetching 11 files: 100%|█████████████████████| 11/11 [00:00<00:00, 59150.44it/s]
expert prompts: 100%|█████████████████████████████| 4/4 [00:00<00:00,  4.47it/s]
WARNING:root:ALL layers have degenerate routing parameters - your prompts may be too similar.
WARNING:root:One or more experts will be underutilized in your model

The model is now merged and saved in the `merge` directory.

Test inference and push the model to the HF Hub

Load the merged model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
            "merge/", device_map='cpu', torch_dtype=torch.float16
)

Login to Huggingface Hub

from huggingface_hub import notebook_login

notebook_login()

Push the model to Huggingface Hub

specify the <repoid/merged-model-name>

model.push_to_hub("Plaban81/Moe-4x7b-math-reason-code")
tokenizer.push_to_hub("Plaban81/Moe-4x7b-math-reason-code")

Test the Merged Model

from transformers import AutoTokenizer
import transformers
import torch

model = "Plaban81/Moe-4x7b-math-reason-code" #If you want to test your own model, replace this value with the model directory path

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.float16, "load_in_4bit": True},
)

Helper Function to generate response

def generate_resposne(query):
    messages = [{"role": "user", "content": query}]
    prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
    return outputs[0]['generated_text']

Ask Query 1

response = generate_resposne("How to cook pizza.Please provide a step by step procedure")
print(response)

#########Output###############
<s>[INST] How to cook pizza.Please provide a step by step procedure [/INST]Cooking a pizza at home can be a fun and delicious experience. Here's a step-by-step procedure to help you make the perfect pizza:

Ingredients:
- Pizza dough (store-bought or homemade)
- Pizza sauce (store-bought or homemade)
- Shredded mozzarella cheese
- Your choice of toppings (pepperoni, mushrooms, onions, peppers, etc.)
- Olive oil
- Cornmeal or flour for dusting

Equipment:
- Pizza stone or baking sheet
- Parchment paper
- Rolling pin
- Spatula

Instructions:

1. Preheat the oven to 450°F (232°C) with a pizza stone or baking sheet inside. If using a pizza stone, place it on the bottom rack of the oven.

2. Dust a clean work surface with flour or cornmeal. Roll out the pizza dough to your desired thickness using a rolling pin.

3. Drizzle a little olive oil on the work surface and place the pizza dough on top

Ask Query 2

response = generate_resposne("Write a python program to generate fibonnaci series.")
print(response)

############ Response #####################
<s>[INST] Write a python program to generate fibonnaci series. [/INST] Here is a simple Python program that generates the Fibonacci series using a loop:

```python
def fibonacci(n):
    a, b = 0, 1
    for i in range(n):
        yield a
        a, b = b, a + b

numbers = fibonacci(10)

for number in numbers:
    print(number)
```

In this program, we define a generator function `fibonacci` that takes a single argument `n`. Inside the function, we initialize two variables `a` and `b` to 0 and 1, respectively. We then use a loop to generate the Fibonacci series up to the `n`th term. In each iteration of the loop, we yield the current value of `a` (which is the `i`th term of the series) and update `a` and `b` to prepare for the next iteration.

To test the function, we call it with an argument of 10, which generates the first 10 terms of the Fibonacci series. We then use a `for` loop to print each term

Ask Query 3

response = generate_resposne("""Select the prime number from the list of numbers provided below.
[1,3,5,6,123,2000]""")
print(response)

###############Response ####################
<s>[INST] Select the prime number from the list of numbers provided below.
[1,3,5,6,123,2000] [/INST] 
The prime numbers from the given list are 1, 3, and 5. However, 1 is not considered a prime number in mathematics as it is not greater than 1. Therefore, the only prime number from the list is 3. The other numbers, 6, 123, and 2000, are not prime numbers

Ask Query 4

response = generate_resposne("""Write an essay on Indusriatlization within 500 words.""")
print(response)

######## Response #####################
<s>[INST] Write an essay on Indusriatlization within 500 words. [/INST] Indusriatlization refers to the process of transforming a traditional agricultural economy into a modern industrial one. This transition, which began in the 19th century, has had a profound impact on the socio-economic development of India. In this essay, we will examine the historical context of Indusriatlization, its key drivers, and its effects on various sectors of the Indian economy.

Historical Context

The British colonization of India in the 19th century laid the foundation for Indusriatlization in the country. The British, who were interested in India's vast market and resources, introduced modern industries such as textiles, iron and steel, and coal mining. They also established modern institutions such as technical schools, research institutes, and factories. The British also implemented policies such as the Doctrine of Lapse, which enabled them to annex Indian princely states, and the Indian Penal Code, which provided a legal framework for contractual relationships.

Key Drivers

There were several key drivers of Indusriatlization in India. The first was the availability of capital. The British capital, which was invested in India, played a crucial

Ask Query 5

response = generate_resposne("""what is square root of 81 + square root of 9""")
print(response)

############ Response####################
<s>[INST] what is square root of 81 + square root of 9 [/INST] 
The square root of 81 is 9 (since 9^2 = 81), and the square root of 9 is 3 (since 3^2 = 9). 
Therefore, the expression simplifies to 9 + 3, or 12

AutoEvaluate the merged model

Model evaluation is an essential aspect of developing and refining models. Various evaluation frameworks and benchmarks have been designed to assess the different capabilities of these models.

In order to autoevaluate we will leverage the colab notebook developed by 🧐 LLM AutoEval

We need to provide tokens for RunPod and GitHub.

We need to prepay your account before you can kick off a run. Without prepaying, your workload terminates after a few seconds without warning or message.

AutoEval uses RunPod to execute the model evaluation. Upon clicking the “Run” button in the AutoEval notebook, we will be instructed that your pod has started on the URL https://www.runpod.io/console/pods, where you will see your running pod instance.

What happens under the hood of AutoEval is three steps:

  • Automated setup and execution using RunPod.

  • Customizable evaluation parameters for tailored benchmarking.

  • Generate a summary and upload it to GitHub Gist for easy sharing and reference.

AutoEval uses the nous benchmark suite, which contains the following list of tasks:

  1. AGIEval: a human-centric benchmark designed to evaluate foundation models’ general abilities in tasks pertinent to human cognition and problem-solving. AGIEval v1.0 contains 20 tasks, including two cloze tasks (Gaokao-Math-Cloze and MATH) and 18 multi-choice question answering tasks.

  2. GPT4ALL: a benchmark suite for evaluating the factual language understanding of LLMs. It consists of various tasks designed to assess an LLM’s ability to understand and respond to factual queries. Some of the tasks in GPT4All include open-ended question answering, closed-ended question-answering, text summarization, and natural language inference.

  3. TruthfulQA: a benchmark suite for evaluating the truthfulness of LLMs. It consists of various tasks designed to assess an LLM’s ability to distinguish between true and false statements. Some of the tasks in TruthfulQA include multiple-choice questions and textual entailment.

  4. BigBench: an extensive benchmark suite that aims to evaluate and measure the capabilities of models across a wide range of tasks. It includes tests for reasoning, language understanding, problem-solving, and more. The idea behind BigBench is to provide a comprehensive and challenging set of tasks that can reveal the strengths and weaknesses of AI models in various domains.

On executing the autoevalaution script the below message will be displayed:

Pod started: https://www.runpod.io/console/pods

Login to runpod account and you can see the pods running

Click on logs to see the details of execution

The entire process took 1 hr 53 seconds to complete and the following were the results written into the github repository as .md file

Comparison with the expert1 model alone

Conclusion

Creating Mixture of Experts (MoEs) is now both easy and affordable. Although mergekit currently supports only a few model types, its popularity suggests that more will be added soon.

In this article, we combined different models and used the resulting model for making predictions. While our new MoE gives good results, it’s important to note that we haven’t fine-tuned it yet. To make the results even better, fine-tuning it with QLoRA would be a good idea.

References

GitHub - cg123/mergekit: Tools for merging pretrained large language models.

mergekit_moe_config.yml · mlabonne/phixtral-4x2_8 at main

Mixture of Experts Explained