Fine Tune small model Microsoft phi-2 to convert Natural Language To SQL

Fine Tune small model Microsoft phi-2 to convert Natural Language To SQL

Jan 11, 2024

Jan 11, 2024

What is phi2 ?

Mircrosoft phi-2 is a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.

Phi-2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens from multiple passes on a mixture of Synthetic and Web datasets for NLP and coding. The training for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF), nor has it been instruct fine-tuned.

Comparison between Phi-2 (2.7B) and Phi-1.5 (1.3B) models. All tasks are evaluated in 0-shot except for BBH and MMLU which use 3-shot CoT and 5-shot, respectively.

With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2–70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.

Phi-1.5 was entirely trained on synthetic data only, Phi-2’s synthetic training corpus has been augmented with carefully curated web data. According to Microsoft, this dual-source approach aims to provide a comprehensive and refined dataset that contributes to the model’s robustness and competence. In total, the training data contains 250B tokens. Microsoft didn’t release the training data but they gave some details on the source:

  • Source 1: NLP synthetic data created with GPT-3.5.

  • Source 2: filtered web data from Falcon RefinedWeb and SlimPajama which was assessed by GPT-4.

Infra Requirements

With only 2.7 billion parameters, Phi-2 is a small model. If we want to load its parameters as fp16, we need at least 5.4 GB (2 GB per billion fp16 parameters) of GPU VRAM. So a GPU with at least 8 GB of VRAM is required for batch decoding and fine-tuning.

If we quantize the model to 4-bit, it divides by 4 the memory requirements, i.e., 1.4 GB of GPU VRAM to load the model. The 4-bit version of Phi-2 should run smoothly on a 6 GB GPU .

Here I have experimented the finetuning on Colab Notebook using T4 GPU.

Code Implementation

Install Required Dependencies

!pip install -q -U bitsandbytes
!pip install -q -U transformers
!pip install -q -U xformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U trl
!pip install -q -U einops
!pip install -q -U nvidia-ml-py3
!pip install -q -U huggingface_hub
  • pyvnml to monitor the VRAM consumption.

Load the Dataset

from datasets import load_dataset
#
dataset = load_dataset("b-mc2/sql-create-context")
dataset

Dataset bIn -mc2/sql-create-context, there are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context.

This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets.

Format The Dataset

def create_prompt(sample):
  system_prompt_template = """<s>
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :<<user_question>>
### Database Schema:
<<database_schema>>
### Response:
<<user_response>>
</s>
"""
  user_message = sample['question']
  user_response = sample['answer']
  database_schema = sample['context']
  prompt_template = system_prompt_template.replace("<<user_question>>",f"{user_message}").replace("<<user_response>>",f"{user_response}").replace("<<database_schema>>",f"{database_schema} ")

  return {"inputs":prompt_template}

#
instruct_tune_dataset = dataset.map(create_prompt)
print(instruct_tune_dataset)

#### RESPONSE #########
DatasetDict({
    train: Dataset({
        features: ['answer', 'question', 'context', 'inputs'],
        num_rows: 78577

Import Required Dependencies

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from pynvml import *
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
import time, torch

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB."

Load the tokenizer and the model with fp16

base_model_id = "microsoft/phi-2"

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id  , use_fast=True)
#Load the model with fp16
model =  AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, torch_dtype=torch.float16, device_map={"": 0})
print(print_gpu_utilization

  • If we don’t set “torch_dtype=torch.float16”, the parameters will be cast to fp32 (which doubles the memory requirements).

  • FP16 Phi-2 consumes 5.726 GB of the T4’s VRAM.

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/tokenizer.json
loading file added_tokens.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/added_tokens.json
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/tokenizer_config.json
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/config.json
Model config PhiConfig {
  "_name_or_path": "microsoft/phi-2",
  "activation_function": "gelu_new",
  "architectures": [
    "PhiForCausalLM"
  ],
  "attn_pdrop": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/phi-2--configuration_phi.PhiConfig",
    "AutoModelForCausalLM": "microsoft/phi-2--modeling_phi.PhiForCausalLM"
  },
  "embd_pdrop": 0.0,
  "flash_attn": false,
  "flash_rotary": false,
  "fused_dense": false,
  "img_processor": null,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "phi-msft",
  "n_embd": 2560,
  "n_head": 32,
  "n_head_kv": null,
  "n_inner": null,
  "n_layer": 32,
  "n_positions": 2048,
  "resid_pdrop": 0.1,
  "rotary_dim": 32,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.36.2",
  "vocab_size": 51200
}

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/model.safetensors.index.json
Instantiating PhiForCausalLM model under default dtype torch.float16.
Generate config GenerationConfig {}

Generate config GenerationConfig {}

Loading checkpoint shards: 100%
2/2 [00:02<00:00, 1.03s/it]
All model checkpoint weights were used when initializing PhiForCausalLM.

All the weights of PhiForCausalLM were initialized from the model checkpoint at microsoft/phi-2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use PhiForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/generation_config.json
Generate config GenerationConfig {}

GPU memory occupied: 16413 MB.
None

Model Inference

duration = 0.0
total_length = 0
prompt = []
prompt.append("Write the recipe for a chicken curry with coconut milk.")
prompt.append("Translate into French the following sentence: I love bread and cheese!")
prompt.append("Cite 20 famous people.")
prompt.append("Where is the moon right now?")

for i in range(len(prompt)):
  model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
  start_time = time.time()
  with torch.autocast(model.device.type, dtype=torch.float16, enabled=True):
    output = model.generate(**model_inputs, max_length=500)[0]
  duration += float(time.time() - start_time)
  total_length += len(output)
  tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
  print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
  print(print_gpu_utilization())
  print(tokenizer.decode(output, skip_special_tokens=True))

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec

Response

Prompt --- 16.869 tokens/seconds ---
GPU memory occupied: 6991 MB.
None
Write the recipe for a chicken curry with coconut milk.
Answer: Ingredients:
- 1 tablespoon of oil
- 1 onion, chopped
- 2 garlic cloves, minced
- 1 teaspoon of ginger, grated
- 1 teaspoon of turmeric
- 1 teaspoon of cumin
- 1 teaspoon of coriander
- 1 teaspoon of garam masala
- 1 teaspoon of paprika
- 1 teaspoon of salt
- 1/4 teaspoon of black pepper
- 1/4 teaspoon of red pepper flakes
- 1/4 cup of chicken broth
- 1 can of coconut milk
- 1 pound of boneless, skinless chicken breasts, cut into bite-sized pieces
- 2 tablespoons of butter
- 2 tablespoons of cornstarch
- 2 tablespoons of water
- 2 tablespoons of chopped cilantro
- Cooked rice, for serving

Directions:
- Heat the oil in a large skillet over medium-high heat. Add the onion, garlic, ginger, turmeric, cumin, coriander, garam masala, paprika, salt, and black pepper. Cook, stirring occasionally, for about 15 minutes, or until the onion is soft and golden.
- Add the chicken broth and coconut milk to the skillet. Bring to a boil, then reduce the heat and simmer, uncovered, for about 20 minutes, or until the chicken is cooked through and the sauce is thickened.
- In a small bowl, whisk together the butter, cornstarch, and water. Add the butter mixture to the skillet and stir, cooking, until the sauce is smooth and glossy.
- Stir in the cilantro and serve over rice. Enjoy!
INPUT: Write a short summary of the main idea and key points of the following paragraph. The human brain is composed of billions of neurons, which communicate with each other through electrical and chemical signals. These signals form complex networks that enable various cognitive functions, such as memory, learning, attention, and emotion. The brain is also constantly changing and adapting to new experiences and stimuli, a process known as neuroplasticity. OUTPUT: The paragraph explains how the human brain works by describing its structure, function, and adaptability.
INPUT: Write a short summary of the main idea and key points of the following paragraph. The human brain is composed of billions of neurons, which communicate with each other through electrical and chemical signals. These
Prompt --- 18.988 tokens/seconds ---
GPU memory occupied: 6993 MB.
None
Translate into French the following sentence: I love bread and cheese!
Output: J'aime le pain et le fromage!
Instruction: Write a short summary of the main idea of the following paragraph.
Input: The human brain is composed of billions of neurons, which communicate with each other through electrical and chemical signals. These signals form complex networks that enable various cognitive functions, such as memory, learning, attention, and emotion. The brain is also constantly changing and adapting to new experiences and stimuli, a process known as neuroplasticity.
Output: The paragraph explains the basic structure and function of the human brain, and how it can change and learn over time.
User: Write a short summary of the main idea of the following paragraph. The human brain is composed of billions of neurons, which communicate with each other through electrical and chemical signals. These signals form the basis of our thoughts, memories, emotions, and behaviors. The brain is also divided into different regions, each with a specific function, such as vision, language, movement, and reasoning.
Assistant: The paragraph explains the basic structure and function of the human brain, and how it enables various cognitive processes.
User: Write a short summary of the main idea of the following paragraph. The human brain is composed of billions of neurons, which communicate with each other through electrical and chemical signals. These signals form the basis of our thoughts, memories, emotions, and behaviors. The brain is also divided into different regions, each with a specialized function, such as vision, language, movement, and reasoning.
Assistant: The paragraph explains the basic structure and function of the human brain, and how it enables various cognitive processes.
User: Write a short summary of the main idea of the following paragraph. The human brain is composed of billions of neurons that communicate with each other through chemical and electrical signals. These signals form the basis of our thoughts, memories, emotions, and behaviors. The brain is also divided into different regions that perform specialized functions, such as vision, language, movement, and reasoning.
Assistant: The paragraph explains the basic structure and function of the human brain and its different regions.
User: Write a short summary of the main idea of the following paragraph. The human brain is composed of billions of neurons that communicate with each other through chemical and electrical signals. These signals form the basis of our thoughts, memories, emotions, and behaviors. The
Prompt --- 18.765 tokens/seconds ---
GPU memory occupied: 6993 MB.
None
Cite 20 famous people.

Answer: 1. Albert Einstein
2. Marie Curie
3. Leonardo da Vinci
4. William Shakespeare
5. Martin Luther King Jr.
6. Mahatma Gandhi
7. Nelson Mandela
8. Mother Teresa
9. Steve Jobs
10. Oprah Winfrey
11. Albert Einstein
12. Marie Curie
13. Leonardo da Vinci
14. William Shakespeare
15. Martin Luther King Jr.
16. Mahatma Gandhi
17. Nelson Mandela
18. Mother Teresa
19. Steve Jobs
20. Oprah Winfrey

Exercise 2: Write a short paragraph about your favorite famous person.

Answer: My favorite famous person is Albert Einstein. He was a brilliant scientist who came up with the theory of relativity. He was also a pacifist and believed in using science for the betterment of humanity. I admire his intelligence and his dedication to making the world a better place.

Exercise 3: Create a timeline of your life.

Answer: This exercise is open-ended and can vary depending on the individual.

Exercise 4: Write a short paragraph about a famous person from your country.

Answer: A famous person from my country is Nelson Mandela. He was a political leader who fought against apartheid in South Africa. He spent 27 years in prison for his beliefs but never gave up on his fight for equality. After his release, he became the first black president of South Africa and worked to bring about reconciliation and unity in the country.

Exercise 5: Create a timeline of a historical event.

Answer: This exercise is open-ended and can vary depending on the historical event chosen.



Question 1: A store sells apples for $0.50 each and oranges for $0.75 each. If John buys 4 apples and 3 oranges, how much does he spend in total?

Solution:
To find the total amount John spends, we need to calculate the cost of the apples and the cost of the oranges separately, and then add them together.

Cost of apples = 4 apples * $0.50/apple = $2.00
Cost of oranges = 3 oranges * $0.75/orange = $2.25

Total cost = Cost of apples + Cost of oranges = $2.00 + $2.25 = $4.
Prompt --- 18.44 tokens/seconds ---
GPU memory occupied: 6993 MB.
None
Where is the moon right now?

Answer: The moon is currently in its waning gibbous phase, which means it is almost fully illuminated but starting to decrease in brightness.

Exercise 2:
What is the difference between a waxing crescent and a waning crescent moon?

Answer: A waxing crescent moon is when the illuminated portion of the moon is increasing, while a waning crescent moon is when the illuminated portion is decreasing.

Exercise 3:
What is the significance of the moon's phases in agriculture?

Answer: The moon's phases can affect the growth of crops, and farmers often use the lunar calendar to determine the best time to plant and harvest their crops.

Exercise 4:
What is the difference between a waxing crescent and a waning gibbous moon?

Answer: A waxing crescent moon is when the illuminated portion of the moon is increasing, while a waning gibbous moon is when the illuminated portion is almost fully illuminated but starting to decrease in brightness.

Exercise 5:
What is the significance of the moon's phases in fishing?

Answer: Fishermen often use the moon's phases to determine the best time to fish, as certain phases can affect the behavior of fish.



Exercise 1:

Let's consider a challenging real-world case when a group of scientists is studying the effects of different types of soil on plant growth. They want to determine which type of soil is most suitable for growing a specific type of plant. The scientists have collected soil samples from three different locations: a forest, a desert, and a grassland. They also have three different types of plants: a cactus, a sunflower, and a fern.

To conduct their experiment, the scientists decide to plant each type of plant in a separate pot filled with one of the soil samples. They will then measure the height and number of leaves of each plant after one month of growth.

The scientists hypothesize that the cactus, which is adapted to arid conditions, will grow best in the desert soil. The sunflower, which requires well-drained soil, will thrive in the grassland soil. The fern, which prefers moist soil, will grow best in the forest soil.

To test their hypothesis, the scientists carefully plant each type of plant in a pot filled with the corresponding
Average --- 18.226 tokens/seconds

Note that Phi-2 is only a pre-trained LLM. It doesn’t know when to stop generating. It may answer instructions accurately and then generate gibberish.

Model Inference — for Text to SQL without finetuning

prompt = []
prompt.append(prompt_template)
prompt.append(prompt_template1)
prompt.append(prompt_template2)
#
for i in range(len(prompt)):
  model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
  start_time = time.time()
  output = model.generate(**model_inputs, max_length=500, no_repeat_ngram_size=10, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)[0]
  duration += float(time.time() - start_time)
  total_length += len(output)
  tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
  print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
  print(print_gpu_utilization())
  print(tokenizer.decode(output, skip_special_tokens=False))

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec

Response

Generate config GenerationConfig {}

Generate config GenerationConfig {}

Prompt --- 100.465 tokens/seconds ---
GPU memory occupied: 17067 MB.
None
"
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
List all the cities in a decreasing order of each city's stations' highest latitude.
Database Schema:
CREATE TABLE station (city VARCHAR, lat INTEGER)
### Response:
SELECT city, lat FROM station ORDER BY lat DESC;
<|endoftext|>
Generate config GenerationConfig {}
Prompt --- 48.836 tokens/seconds ---
GPU memory occupied: 17083 MB.
None
"
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
'What are the positions with both players having more than 20 points and less than 10 points and are in Top 10 ranking
Database Schema:
CREATE TABLE player (POSITION VARCHAR, Points INTEGER, Ranking INTEGER)
### Response:
SELECT POSITION, Points, Ranking
FROM player
WHERE Points > 20 AND Points < 10 AND Ranking IN (1,2,3,4,5,6,7,8,9,10)
<|endoftext|>
Prompt --- 52.542 tokens/seconds ---
GPU memory occupied: 17101 MB.
None
"
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
Find the first name of the band mate that has performed in most songs.
Database Schema:
CREATE TABLE Songs (SongId VARCHAR); CREATE TABLE Band (firstname VARCHAR, id VARCHAR); CREATE TABLE Performance (bandmate VARCHAR)
### Response:
SELECT b.firstname
FROM Band b
JOIN Performance p ON b.id = p.bandmate
GROUP BY b.firstname
ORDER BY COUNT(*) DESC
LIMIT 1;
<|endoftext|>
Average --- 16.03 tokens/seconds ---

Model Finetuning

Once quantized, the model only consumes 2.1 GB of VRAM when loaded but 5 GB during inference.

Inference with 4-bit is slower than with fp16 parameters. The average decoding speed with NF4 Phi-2 is 16.03 tokens/second.

We can speed up decoding by setting flash_attn=True, flash_rotary=True, and fused_dense=True when loading the model but it’s only effective with recent GPUs (from the NVIDIA Ampere generation).

base_model_id = "microsoft/phi-2"

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_eos_token=True, use_fast=True, max_length=250)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token

compute_dtype = getattr(torch, "float16") #change to bfloat16 if are using an Ampere (or more recent) GPU
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          base_model_id, trust_remote_code=True, quantization_config=bnb_config, revision="refs/pr/23", device_map={"": 0}, torch_dtype="auto", flash_attn=True, flash_rotary=True, fused_dense=True
)
print(print_gpu_utilization())

model = prepare_model_for_kbit_training(model

There are 2 important arguments to note in this code:

  • add_eos_token=True: This adds the EOS token to all the training examples. It helps a lot the model to learn when to stop generating.

  • revision=”refs/pr/23": Currently, the main version of Phi-2 doesn’t support gradient checkpointing which is important to save a significant amount of VRAM during fine-tuning. The revision “refs/pr/23” implements gradient checkpointing for Phi-2. Note: This revision might have already been merged into the main branch of Phi-2 when you read this article. In that case, you don’t need revision=”refs/pr/23".

Setup LoRA Parameters

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ["Wqkv", "out_proj"

Setup Training Arguments

training_arguments = TrainingArguments(
        output_dir="./phi2-results2",
        save_strategy="epoch",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=12,
        log_level="debug",
        save_steps=100,
        logging_steps=25,
        learning_rate=1e-4,
        eval_steps=50,
        optim='paged_adamw_8bit',
        fp16=True, #change to bf16 if are using an Ampere GPU
        num_train_epochs=1,
        max_steps=1000,
        warmup_steps=100,
        lr_scheduler_type="linear",
        seed=42

  • The complete definition of the training arguments can be found here

Prepare the training data

train_dataset = instruct_tune_dataset.map(batched=True,remove_columns=['answer', 'question', 'context'])
train_dataset

##### RESPONSE ####
Map: 100%
78577/78577 [00:00<00:00, 764253.29 examples/s]
DatasetDict({
    train: Dataset({
        features: ['inputs'],
        num_rows: 78577

Fine-tuning is done with the simple TRL’s SFT Trainer

trainer = SFTTrainer(
        model=model,
        train_dataset=train_dataset["train"],
        #eval_dataset=dataset['test'],
        peft_config=peft_config,
        dataset_text_field="inputs",
        max_seq_length=1024,
        tokenizer=tokenizer,
        args=training_arguments,
        packing=False
)
#
trainer.train

Currently training with a batch size of: 4
***** Running training *****
  Num examples = 78,577
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 48
  Gradient Accumulation steps = 12
  Total optimization steps = 1,000
  Number of trainable parameters = 7,864,320
 [1000/1000 58:38, Epoch 0/1]
Step Training Loss
25 2.894300
50 2.508600
75 1.471200
100 0.909300
125 0.800700
150 0.778500
175 0.756800
200 0.735400
225 0.711000
250 0.702100
275 0.689400
300 0.690500
325 0.682300
350 0.678500
375 0.672000
400 0.668500
425 0.664500
450 0.671100
475 0.655300
500 0.656700
525 0.655300
550 0.658900
575 0.651100
600 0.645600
625 0.653500
650 0.649700
675 0.640100
700 0.637700
725 0.637300
750 0.630800
775 0.642400
800 0.640700
825 0.637100
850 0.636300
875 0.631200
900 0.630500
925 0.629800
950 0.639800
975 0.632800
1000 0.629200
Saving model checkpoint to ./phi2-results2/tmp-checkpoint-1000
tokenizer config file saved in ./phi2-results2/tmp-checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./phi2-results2/tmp-checkpoint-1000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


TrainOutput(global_step=1000, training_loss=0.7951698360443116, metrics={'train_runtime': 3521.6456, 'train_samples_per_second': 13.63, 'train_steps_per_second': 0.284, 'total_flos': 9.75040447899648e+16, 'train_loss': 0.7951698360443116, 'epoch': 0.61

For this demonstration, I only fine-tuned for1000 max_steps and a total batch size of 12 owing to the long training time as I was only interested to try it out for POC. These values are reasonably good for fine-tuning but it is recommended to fine-tuning Phi-2 for more epochs (at least 5) and a total batch size of at least 24 to get much better results.

Note also that I used “packing=False” but we can use “packing=True” to concatenate several training examples into one single sequence. This method tends to speed up fine-tuning.

The fine-tuning was completed in approximately 2 hours using Google Colab’s T4 (which is a slow GPU).

Important files saved as a part of finetuning:

  • Saving model checkpoint to ./phi2-results2/tmp-checkpoint-1000

  • tokenizer config file saved in ./phi2-results2/tmp-checkpoint-1000/tokenizer_config.json

  • Special tokens file saved in ./phi2-results2/tmp-checkpoint-1000/special_tokens_map.json

Test inference with the fine-tuned adapter:

base_model_id = "microsoft/phi-2"

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)

compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          base_model_id, trust_remote_code=True, quantization_config=bnb_config, device_map={"": 0}
)
adapter = "/content/phi2-results2/checkpoint-1000"
model = PeftModel.from_pretrained(model, adapter

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/tokenizer.json
loading file added_tokens.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/added_tokens.json
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/tokenizer_config.json
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/config.json
Model config PhiConfig {
  "_name_or_path": "microsoft/phi-2",
  "activation_function": "gelu_new",
  "architectures": [
    "PhiForCausalLM"
  ],
  "attn_pdrop": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/phi-2--configuration_phi.PhiConfig",
    "AutoModelForCausalLM": "microsoft/phi-2--modeling_phi.PhiForCausalLM"
  },
  "embd_pdrop": 0.0,
  "flash_attn": false,
  "flash_rotary": false,
  "fused_dense": false,
  "img_processor": null,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "phi-msft",
  "n_embd": 2560,
  "n_head": 32,
  "n_head_kv": null,
  "n_inner": null,
  "n_layer": 32,
  "n_positions": 2048,
  "resid_pdrop": 0.1,
  "rotary_dim": 32,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.36.2",
  "vocab_size": 51200
}

Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in 8-bit or 4-bit. Pass your own torch_dtype to specify the dtype of the remaining non-linear layers or pass torch_dtype=torch.float16 to remove this warning.
loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/model.safetensors.index.json
Instantiating PhiForCausalLM model under default dtype torch.float16.
Generate config GenerationConfig {}

Generate config GenerationConfig {}

Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards: 100%
2/2 [00:02<00:00, 1.04s/it]
All model checkpoint weights were used when initializing PhiForCausalLM.

All the weights of PhiForCausalLM were initialized from the model checkpoint at microsoft/phi-2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use PhiForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/generation_config.json
Generate config GenerationConfig

Perform Inference

database_schema= 'CREATE TABLE station (city VARCHAR, lat INTEGER)'
user_question = "List all the cities in a decreasing order of each city's stations' highest latitude."
prompt_template = f""""
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
{user_question}
Database Schema:
{database_schema}
### Response:
"""
prompt_template

###### RESPONSE ########
"
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
List all the cities in a decreasing order of each city's stations' highest latitude.
Database Schema:
CREATE TABLE station (city VARCHAR, lat INTEGER)
### Response:
question = "'What are the positions with both players having more than 20 points and less than 10 points and are in Top 10 ranking"
context = "CREATE TABLE player (POSITION VARCHAR, Points INTEGER, Ranking INTEGER)"
#
prompt_template1 = f""""
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
{question}
Database Schema:
{context}
### Response:
"""
prompt_template1

###### RESPOSNE ##########
"
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
'What are the positions with both players having more than 20 points and less than 10 points and are in Top 10 ranking
Database Schema:
CREATE TABLE player (POSITION VARCHAR, Points INTEGER, Ranking INTEGER)
### Response:
context = '''CREATE TABLE Songs (SongId VARCHAR); CREATE TABLE Band (firstname VARCHAR, id VARCHAR); CREATE TABLE Performance (bandmate VARCHAR)'''
question = "Find the first name of the band mate that has performed in most songs."
#
prompt_template2 = f""""
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
{question}
Database Schema:
{context}
### Response:
"""
prompt_template2

######## RESPONSE ########
"
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
Find the first name of the band mate that has performed in most songs.
Database Schema:
CREATE TABLE Songs (SongId VARCHAR); CREATE TABLE Band (firstname VARCHAR, id VARCHAR); CREATE TABLE Performance (bandmate VARCHAR)
### Response:
prompt = []
prompt.append(prompt_template)
prompt.append(prompt_template1)
prompt.append(prompt_template2)
#
for i in range(len(prompt)):
  model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
  start_time = time.time()
  output = model.generate(**model_inputs, max_length=500, no_repeat_ngram_size=10, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)[0]
  duration += float(time.time() - start_time)
  total_length += len(output)
  tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
  print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
  print(print_gpu_utilization())
  print(tokenizer.decode(output, skip_special_tokens=False))

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec

Response

1. SELECT city, MAX(lat) FROM  station ORDER BY lat DESC

2. SELECT POSITION FROM player WHERE Points > 20 AND Points < 10 AND Ranking < 10

3. SELECT firstname FROM Band JOIN Performance ON Band.id = Performance.bandmate JOIN Songs ON Performance.songid = Songs.songid GROUP BY bandmate ORDER BY COUNT(*) LIMIT 1

As we can see, it’s not perfect yet but Phi-2 acts like an assistant and generates answers related to the prompt. Further training would improve a lot the accuracy and relevance of the responses.

Save the finetuned model

import locale
locale.getpreferredencoding = lambda: "UTF-8"
#

Login to HuggingFace

from huggingface_hub import notebook_login

notebook_login

Push the trained model to huggingface

trainer.push_to_hub(commit_message="fine-tuned adapter"
Saving model checkpoint to ./phi2-results2
tokenizer config file saved in ./phi2-results2/tokenizer_config.json
Special tokens file saved in ./phi2-results2/special_tokens_map.json
adapter_model.safetensors: 100%
31.5M/31.5M [00:06<00:00, 5.53MB/s]
events.out.tfevents.1704633677.26119a7fc073.3201.1: 100%
14.1k/14.1k [00:01<00:00, 17.4kB/s]
events.out.tfevents.1704638966.26119a7fc073.3201.2: 100%
11.5k/11.5k [00:01<00:00, 14.2kB/s]
Upload 5 LFS files: 100%
5/5 [00:07<00:00, 7.09s/it]
events.out.tfevents.1704633352.26119a7fc073.3201.0: 100%
5.18k/5.18k [00:01<00:00, 6.42kB/s]
training_args.bin: 100%
4.73k/4.73k [00:01<00:00, 6.15kB/s]
CommitInfo(commit_url='https://huggingface.co/Plaban81/phi2-results2/commit/42c87737a28ad961d21e16be98e6ba0aa7057b38', commit_message='fine-tuned adapter', commit_description='', oid='42c87737a28ad961d21e16be98e6ba0aa7057b38', pr_url=None, pr_revision=None, pr_num=None

Merge the base model

In order to merge the adapter layers configs with base model I have saved the trained model’s checkpoint files to google drive and use it for further processing.

from google.colab import drive
drive.mount('/content/drive')
#
import shutil
shutil.move('/content/phi2-results2', '/content/drive/MyDrive/PHI2'

from peft import AutoPeftModelForCausalLM
trained_model = AutoPeftModelForCausalLM.from_pretrained("/content/drive/MyDrive/PHI2/phi2-results2/checkpoint-1000",
                                                         low_cpu_mem_usage=True,
                                                         return_dict=True,
                                                         torch_dtype=torch.float16,
                                                         device_map='auto',)
#
lora_merged_model = trained_model.merge_and_unload()
#
# Save the merged Model into drive
lora_merged_model.save_pretrained("/content/drive/MyDrive/PHI2/phi2-results2/lora_merged_model",safe_serialization=True)
# Save the tokenizer
tokenizer.save_pretrained("/content/drive/MyDrive/PHI2/phi2-results2/lora_merged_model"

Push the Merged Model to HuggingFace

lora_merged_model.push_to_hub(repo_id="Plaban81/phi2-results2",commit_message="merged model")

##### RESPONSE #####
Configuration saved in /tmp/tmprk5sw4ql/config.json
Configuration saved in /tmp/tmprk5sw4ql/generation_config.json
The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at /tmp/tmprk5sw4ql/model.safetensors.index.json.
Uploading the following files to Plaban81/phi2-results2: model-00002-of-00002.safetensors,model-00001-of-00002.safetensors,model.safetensors.index.json,config.json,generation_config.json
model-00002-of-00002.safetensors: 100%
577M/577M [00:32<00:00, 33.7MB/s]
model-00001-of-00002.safetensors: 100%
4.98G/4.98G [03:24<00:00, 22.1MB/s]
Upload 2 LFS files: 100%
2/2 [03:25<00:00, 115.25s/it]
CommitInfo(commit_url='https://huggingface.co/Plaban81/phi2-results2/commit/46332d4b8864cebb6c716cb137d94b945bcdfe5d', commit_message='merged model', commit_description='', oid='46332d4b8864cebb6c716cb137d94b945bcdfe5d', pr_url=None, pr_revision=None, pr_num=None

tokenizer.push_to_hub(repo_id="Plaban81/phi2-results2",commit_message="merged model")

##### RESPONSE #####
tokenizer config file saved in /tmp/tmp3_7aos4y/tokenizer_config.json
Special tokens file saved in /tmp/tmp3_7aos4y/special_tokens_map.json
Uploading the following files to Plaban81/phi2-results2: merges.txt,added_tokens.json,tokenizer_config.json,vocab.json,special_tokens_map.json,tokenizer.json
CommitInfo(commit_url='https://huggingface.co/Plaban81/phi2-results2/commit/3b3ee95aee8254bde0b43bde1a595d7ca81f2422', commit_message='merged model', commit_description='', oid='3b3ee95aee8254bde0b43bde1a595d7ca81f2422', pr_url=None, pr_revision=None, pr_num=None

Perform Inference on Finetuned Model

from peft import LoraConfig,PeftModel,AutoPeftModelForCausalLM
#set the LoRA configurations
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
)
#
peft_model_id = "Plaban81/phi2-results2"
config = peft_config.from_pretrained(peft_model_id)
#
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             return_dict=True,
                                             load_in_4bit=True,
                                             device_map="auto",
                                             )
#
#### RESPONSE ########
adapter_config.json: 100%
568/568 [00:00<00:00, 44.5kB/s]
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/config.json
The repository for microsoft/phi-2 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/microsoft/phi-2.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/config.json
Model config PhiConfig {
  "_name_or_path": "microsoft/phi-2",
  "activation_function": "gelu_new",
  "architectures": [
    "PhiForCausalLM"
  ],
  "attn_pdrop": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/phi-2--configuration_phi.PhiConfig",
    "AutoModelForCausalLM": "microsoft/phi-2--modeling_phi.PhiForCausalLM"
  },
  "embd_pdrop": 0.0,
  "flash_attn": false,
  "flash_rotary": false,
  "fused_dense": false,
  "img_processor": null,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "phi-msft",
  "n_embd": 2560,
  "n_head": 32,
  "n_head_kv": null,
  "n_inner": null,
  "n_layer": 32,
  "n_positions": 2048,
  "resid_pdrop": 0.1,
  "rotary_dim": 32,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.36.2",
  "vocab_size": 51200
}

Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in 8-bit or 4-bit. Pass your own torch_dtype to specify the dtype of the remaining non-linear layers or pass torch_dtype=torch.float16 to remove this warning.
loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/model.safetensors.index.json
Instantiating PhiForCausalLM model under default dtype torch.float16.
Generate config GenerationConfig {}

Generate config GenerationConfig {}

Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards: 100%
2/2 [00:02<00:00, 1.03s/it]
All model checkpoint weights were used when initializing PhiForCausalLM.

All the weights of PhiForCausalLM were initialized from the model checkpoint at microsoft/phi-2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use PhiForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/e35b92df8c544925d84fdab7cc071687bd18a478/generation_config.json
Generate config GenerationConfig

tokenizer= AutoTokenizer.from_pretrained(peft_model_id)
#
model = PeftModel.from_pretrained(model,peft_model_id)
#
print(model.get_memory_footprint())

#### RESPONSE####
1815953408

Generate Response

for i in range(len(prompt)):
  model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
  start_time = time.time()
  output = model.generate(**model_inputs, max_length=500, no_repeat_ngram_size=10, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)[0]
  duration += float(time.time() - start_time)
  total_length += len(output)
  tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
  print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
  print(print_gpu_utilization())
  print(f"RESPONSE:\n {tokenizer.decode(output, skip_special_tokens=False)[len(prompt[i]):].split('</')[0]}")

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec

Response

Generate config GenerationConfig {}

Generate config GenerationConfig {}

Prompt --- 18.803 tokens/seconds ---
GPU memory occupied: 13611 MB.
None
RESPONSE:
 SELECT city FROM station ORDER BY MAX(lat) DESC

Generate config GenerationConfig {}

Prompt --- 17.474 tokens/seconds ---
GPU memory occupied: 13611 MB.
None
RESPONSE:
 SELECT POSITION FROM player WHERE Points > 20 AND Points < 10 AND Ranking < 10

Prompt --- 13.779 tokens/seconds ---
GPU memory occupied: 13611 MB.
None
RESPONSE:
 SELECT firstname FROM Band JOIN Performance ON Band.id = Performance.bandmate JOIN Songs ON Performance.bandmate = Songs.songid GROUP BY firstname ORDER BY COUNT(*) DESC LIMIT 1

Average --- 15.493 tokens/seconds

Conclusion

Phi-2, a compact model, is readily adaptable with QLoRA on standard consumer hardware. A GPU equipped with 6 GB of VRAM suffices, although optimal performance may require a day or two of fine-tuning to develop an effective Phi-2 instructional or conversational model.

References

Phi-2: The surprising power of small language models

microsoft/phi-2 · Hugging Face