Skip to main content
  1. Data Science Blog/

Stanford Alpaca

·1033 words·5 mins· loading · ·
AI/ML Models Language Models (LLMs) Technology Trends & Future Language Models (LLMs) Specific AI Models AI Models AI Model Training Open Source AI Machine Learning (ML)

On This Page

Table of Contents
Share with :

Stanford-Alpaca

Stanford Alpaca
#

Introduction
#

Stanford Alpaca Github Report

  • Stanford Alpaca is An “Instruction-following” LLaMA Model
  • This is the repo aims to build and share an instruction-following LLaMA model. The repo contains:
    • The 52K instruction-following data used for fine-tuning the model.
    • The code for generating the data.
    • The code for fine-tuning the model.
    • The code for recovering Alpaca-7B weights from our released weight diff.

Overview
#

  • The current “Alpaca 7B model” is fine-tuned from a “7B LLaMA” model on 52K instruction-following data generated by the techniques in the Self-Instruct paper.
  • Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite.
  • Alpaca is still under development, and there are many limitations that have to be addressed.
  • Alphaca is not yet fine-tuned to be safe and harmless.
  • Initial release contains the data generation procedure, dataset, and training recipe.
  • Model weights can be released if the creators of LLaMA gives permission.
  • Live demo to help readers better understand the capabilities and limits of Alpaca is available.
  • Based on followin papers:
    • LLaMA: Open and Efficient Foundation Language Models. Hugo2023
    • Self-Instruct: Aligning Language Model with Self Generated Instructions. Yizhong2022
  • Data Release
    • alpaca_data.json contains 52K instruction-following data we used for fine-tuning the Alpaca model. This JSON file is a list of dictionaries, each dictionary contains the following fields: Instruction, input, output (text-davinci-003 geneated answer).

Highlevel Activities of the Alpaca Project
#

Highlevel Actitivies done by Stanford Alpaca team and Project Output

  1. Data Generation: The team used OpenAI’s GPT-3.5 model to generate a dataset of 52,000 instruction-response pairs. They did this by providing GPT-3.5 with a variety of instructions and asking it to produce corresponding responses.

  2. Fine-Tuning: They used this generated dataset to fine-tune Meta’s LLaMA model, making it better at following instructions.

  3. Evaluation: The fine-tuned Alpaca model was then evaluated for its ability to follow instructions effectively, comparing its performance to more advanced models.

Output: The project resulted in a fine-tuned version of the LLaMA model, called Alpaca, which is smaller, more efficient, and capable of following instructions well.

Capablitities
#

This model can perform following tasks.

Data Generation
#

Output
#

  • This process produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500).
  • The dataset of 52K generated data is much more diverse than the data released by self-instruct.

Process
#

  • Built on the data generation pipeline from self-instruct and made the following modifications:
  • Used text-davinci-003 to generate the instruction data instead of davinci.
  • Wrote a new prompt (prompt.txt) that explicitly gave the requirement of instruction generation to text-davinci-003.
  • Adopted much more aggressive batch decoding, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation.
  • Simplified the data generation pipeline by discarding the difference between classification and non-classification instructions.
  • Only generated a single instance for each instruction

Fine-tuning
#

Created a fine-tuned model using standard Hugging Face training code. fine-tuned LLaMA-7B and LLaMA-13B with the following hyperparameters:

- Hyperparameter	LLaMA-7B	LLaMA-13B
- Batch size	128	128
- Learning rate	2e-5	1e-5
- Epochs	3	5
- Max length	512	512
- Weight decay	0	0

Fine-tuning Dependency and LLaMa Installation
#

To reproduce fine-tuned model, first install the requirements

pip install -r requirements.txt

Command to finetune
#

  • Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard mode.
  • We were able to reproduce a model of similar quality as the one we hosted in our demo with the following command using Python 3.10.
  • Replace <your_random_port> with a port of your own, <your_path_to_hf_converted_llama_ckpt_and_tokenizer> with the path to your converted checkpoint and tokenizer (following instructions in the PR), and <your_output_dir> with where you want to store your outputs.
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True

The same script also works for OPT fine-tuning. Here’s an example for fine-tuning OPT-6.7B

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path "facebook/opt-6.7b" \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \
    --tf32 True

Addressing OOM
#

  • Naively, fine-tuning a 7B model requires about 7 x 4 x 4 = 112 GB of VRAM.

  • Commands given above enable parameter sharding, so no redundant model copy is stored on any GPU. If you’d like to further reduce the memory footprint, here are some options:

  • Turn on CPU offload for FSDP with –fsdp “full_shard auto_wrap offload”. This saves VRAM at the cost of longer runtime.

  • DeepSpeed stage-3 (with offload) can at times be more memory efficient than FSDP with offload. Here’s an example to use DeepSpeed stage-3 with 4 GPUs with both parameter and optimizer offload:

pip install deepspeed
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_offload_opt_param.json" \
    --tf32 True

The DeepSpeed library also provides some helpful functions to estimate memory usage.

  • LoRA fine-tunes low-rank slices of the query, key, and value embedding heads. This can reduce the total memory footprint from 112GB to about 7x4=28GB.
  1. Convert Meta’s released weights into huggingface format. Follow this guide: https://huggingface.co/docs/transformers/main/model_doc/llama
  2. Make sure you cloned the released weight diff into your local machine. The weight diff is located at: https://huggingface.co/tatsu-lab/alpaca-7b/tree/main
  3. Run this function with the correct paths. E.g., python weight_diff.py recover –path_raw <path_to_step_1_dir> –path_diff <path_to_step_2_dir> –path_tuned <path_to_store_recovered_weights> Once step 3 completes, you should have a directory with the recovered weights, from which you can load the model like the following
import transformers
alpaca_model = transformers.AutoModelForCausalLM.from_pretrained("<path_to_store_recovered_weights>")
alpaca_tokenizer = transformers.AutoTokenizer.from_pretrained("<path_to_store_recovered_weights>")
Dr. Hari Thapliyaal's avatar

Dr. Hari Thapliyaal

Dr. Hari Thapliyal is a seasoned professional and prolific blogger with a multifaceted background that spans the realms of Data Science, Project Management, and Advait-Vedanta Philosophy. Holding a Doctorate in AI/NLP from SSBM (Geneva, Switzerland), Hari has earned Master's degrees in Computers, Business Management, Data Science, and Economics, reflecting his dedication to continuous learning and a diverse skill set. With over three decades of experience in management and leadership, Hari has proven expertise in training, consulting, and coaching within the technology sector. His extensive 16+ years in all phases of software product development are complemented by a decade-long focus on course design, training, coaching, and consulting in Project Management. In the dynamic field of Data Science, Hari stands out with more than three years of hands-on experience in software development, training course development, training, and mentoring professionals. His areas of specialization include Data Science, AI, Computer Vision, NLP, complex machine learning algorithms, statistical modeling, pattern identification, and extraction of valuable insights. Hari's professional journey showcases his diverse experience in planning and executing multiple types of projects. He excels in driving stakeholders to identify and resolve business problems, consistently delivering excellent results. Beyond the professional sphere, Hari finds solace in long meditation, often seeking secluded places or immersing himself in the embrace of nature.

Comments:

Share with :

Related

What is a Digital Twin?
·805 words·4 mins· loading
Industry Applications Technology Trends & Future Computer Vision (CV) Digital Twin Internet of Things (IoT) Manufacturing Technology Artificial Intelligence (AI) Graphics
What is a digital twin? # A digital twin is a virtual representation of a real-world entity or …
Frequencies in Time and Space: Understanding Nyquist Theorem & its Applications
·4103 words·20 mins· loading
Data Analysis & Visualization Computer Vision (CV) Mathematics Signal Processing Space Exploration Statistics
Applications of Nyquists theorem # Can the Nyquist-Shannon sampling theorem applies to light …
The Real Story of Nyquist, Shannon, and the Science of Sampling
·1146 words·6 mins· loading
Technology Trends & Future Interdisciplinary Topics Signal Processing Remove Statistics Technology Concepts
The Story of Nyquist, Shannon, and the Science of Sampling # In the early days of the 20th century, …
BitNet b1.58-2B4T: Revolutionary Binary Neural Network for Efficient AI
·2637 words·13 mins· loading
AI/ML Models Artificial Intelligence (AI) AI Hardware & Infrastructure Neural Network Architectures AI Model Optimization Language Models (LLMs) Business Concepts Data Privacy Remove
Archive Paper Link BitNet b1.58-2B4T: The Future of Efficient AI Processing # A History of 1 bit …
Ollama Setup and Running Models
·1753 words·9 mins· loading
AI and NLP Ollama Models Ollama Large Language Models Local Models Cost Effective AI Models
Ollama: Running Large Language Models Locally # The landscape of Artificial Intelligence (AI) and …