Stanford Alpaca#
Introduction#
- Stanford Alpaca is An “Instruction-following” LLaMA Model
- This is the repo aims to build and share an instruction-following LLaMA model. The repo contains:
- The 52K instruction-following data used for fine-tuning the model.
- The code for generating the data.
- The code for fine-tuning the model.
- The code for recovering Alpaca-7B weights from our released weight diff.
Overview#
- The current “Alpaca 7B model” is fine-tuned from a “7B LLaMA” model on 52K instruction-following data generated by the techniques in the Self-Instruct paper.
- Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite.
- Alpaca is still under development, and there are many limitations that have to be addressed.
- Alphaca is not yet fine-tuned to be safe and harmless.
- Initial release contains the data generation procedure, dataset, and training recipe.
- Model weights can be released if the creators of LLaMA gives permission.
- Live demo to help readers better understand the capabilities and limits of Alpaca is available.
- Based on followin papers:
- LLaMA: Open and Efficient Foundation Language Models. Hugo2023
- Self-Instruct: Aligning Language Model with Self Generated Instructions. Yizhong2022
- Data Release
- alpaca_data.json contains 52K instruction-following data we used for fine-tuning the Alpaca model. This JSON file is a list of dictionaries, each dictionary contains the following fields: Instruction, input, output (text-davinci-003 geneated answer).
Highlevel Activities of the Alpaca Project#
Highlevel Actitivies done by Stanford Alpaca team and Project Output
Data Generation: The team used OpenAI’s GPT-3.5 model to generate a dataset of 52,000 instruction-response pairs. They did this by providing GPT-3.5 with a variety of instructions and asking it to produce corresponding responses.
Fine-Tuning: They used this generated dataset to fine-tune Meta’s LLaMA model, making it better at following instructions.
Evaluation: The fine-tuned Alpaca model was then evaluated for its ability to follow instructions effectively, comparing its performance to more advanced models.
Output: The project resulted in a fine-tuned version of the LLaMA model, called Alpaca, which is smaller, more efficient, and capable of following instructions well.
Capablitities#
This model can perform following tasks.
Data Generation#
Output#
- This process produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500).
- The dataset of 52K generated data is much more diverse than the data released by self-instruct.
Process#
- Built on the data generation pipeline from self-instruct and made the following modifications:
- Used text-davinci-003 to generate the instruction data instead of davinci.
- Wrote a new prompt (prompt.txt) that explicitly gave the requirement of instruction generation to text-davinci-003.
- Adopted much more aggressive batch decoding, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation.
- Simplified the data generation pipeline by discarding the difference between classification and non-classification instructions.
- Only generated a single instance for each instruction
Fine-tuning#
Created a fine-tuned model using standard Hugging Face training code. fine-tuned LLaMA-7B and LLaMA-13B with the following hyperparameters:
- Hyperparameter LLaMA-7B LLaMA-13B
- Batch size 128 128
- Learning rate 2e-5 1e-5
- Epochs 3 5
- Max length 512 512
- Weight decay 0 0
Fine-tuning Dependency and LLaMa Installation#
To reproduce fine-tuned model, first install the requirements
pip install -r requirements.txt
Command to finetune#
- Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard mode.
- We were able to reproduce a model of similar quality as the one we hosted in our demo with the following command using Python 3.10.
- Replace <your_random_port> with a port of your own, <your_path_to_hf_converted_llama_ckpt_and_tokenizer> with the path to your converted checkpoint and tokenizer (following instructions in the PR), and <your_output_dir> with where you want to store your outputs.
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
--model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir <your_output_dir> \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True
The same script also works for OPT fine-tuning. Here’s an example for fine-tuning OPT-6.7B
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
--model_name_or_path "facebook/opt-6.7b" \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir <your_output_dir> \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \
--tf32 True
Addressing OOM#
Naively, fine-tuning a 7B model requires about 7 x 4 x 4 = 112 GB of VRAM.
Commands given above enable parameter sharding, so no redundant model copy is stored on any GPU. If you’d like to further reduce the memory footprint, here are some options:
Turn on CPU offload for FSDP with –fsdp “full_shard auto_wrap offload”. This saves VRAM at the cost of longer runtime.
DeepSpeed stage-3 (with offload) can at times be more memory efficient than FSDP with offload. Here’s an example to use DeepSpeed stage-3 with 4 GPUs with both parameter and optimizer offload:
pip install deepspeed
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
--model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir <your_output_dir> \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--deepspeed "./configs/default_offload_opt_param.json" \
--tf32 True
The DeepSpeed library also provides some helpful functions to estimate memory usage.
- LoRA fine-tunes low-rank slices of the query, key, and value embedding heads. This can reduce the total memory footprint from 112GB to about 7x4=28GB.
- Convert Meta’s released weights into huggingface format. Follow this guide: https://huggingface.co/docs/transformers/main/model_doc/llama
- Make sure you cloned the released weight diff into your local machine. The weight diff is located at: https://huggingface.co/tatsu-lab/alpaca-7b/tree/main
- Run this function with the correct paths. E.g., python weight_diff.py recover –path_raw <path_to_step_1_dir> –path_diff <path_to_step_2_dir> –path_tuned <path_to_store_recovered_weights> Once step 3 completes, you should have a directory with the recovered weights, from which you can load the model like the following
import transformers
alpaca_model = transformers.AutoModelForCausalLM.from_pretrained("<path_to_store_recovered_weights>")
alpaca_tokenizer = transformers.AutoTokenizer.from_pretrained("<path_to_store_recovered_weights>")
Comments: