Welcome to PandaLLMOps! 欢迎来到 PandLLMOps !
PandaLLMOps is an open-sourced Python framework for large language models (LLMs) training and inference.
PandaLLMOps 是一个用于大型语言模型(LLMs)训练和推理的开源Python框架。
Note
This tutorial is under active development. 该教程持续更新中
Contents
Quick Start
We will take deployment and training of Panda-13B as an example.
Installation
Download our code from github
$ git clone https://github.com/dandelionsllm/pandallm
Install the requirements in a new environment
$ conda create -n pandallm python=3.10
$ conda activate pandallm
(pandallm) $ pip install -r requirements.txt
(pandallm) $ mkdir pretrained_model
Quick Deployment
Download
LlaMA-13B
from Huggingface.Download our model form Huggingface. Since the model file is too large for git clone, you may manually download the model files from here.
(pandallm) $ mkdir delta-models
(pandallm) $ cd delta-models
(pandallm) $ git clone https://huggingface.co/chitanda/llama-panda-13b-zh-wudao-chat-delta
Move the downloaded files to the corresponding directory.
(pandallm) $ cd ..
(pandallm) $ mv delta-models/ ./
Convert
"delta-model"
to a pretrained model. Replace ${PATH_TO_YOUR_MODEL} with your desired model path, where your model will be saved there.
(pandallm) $ python apply_delta.py --base_model ${PATH_TO_YOUR_MODEL} --target_model ./pretrained_model/panda-13B --delta_model ./delta-models/llama-panda-13b-zh-wudao-chat-delta/checkpoint-3000-delta
Run the following command to deploy the chatbot.
(pandallm) $ python run_chat.py --model_path ./pretrained_model/panda-13B --query "write a peom"
Quick Train
Before you can directly train the model with the following commands, make sure you have finish the installation.
Prepare the training data. You can download the training data from here. Please put the data folders at
./dataset
.Run the following command to train the model:
(pandallm) $ PAD_TOKEN="</s>" deepspeed --include localhost:0,1,2,3,4,5,6,7 trainer_base_ds_mul.py -cp conf/llama/zh/ -cn llama_13b_zh_instruct_sft_combine_v1_0_ds
If you have less than \(8\) GPUs, you can change the --include parameter
to the GPUs you have, e.g. "--include localhost:0,1,2,3"
if you have \(4\) GPUS on one server.
Train your LLM
PandaLLM enables efficient training of various LLMs by leveraging the DeepSpeed acceleration framework and the FairScale parallelization framework. You can train your LLM with a customized configuration using the following command:
(pandallm) $ python train.py --model llama-7b
When you execute the train.py
script, it automatically generates a training configuration file at ./conf/tmp.yaml
based on the configuration template file located at ./conf/template.yaml
. Subsequently, the script initiates the training process by executing ./trainer_torch_fsdp_wandb.py
. If you prefer to train your model with a personalized configuration, you can execute the following command:
(pandallm) $ python train.py --conf_path ${PATH_TO_YOUR_CONF_FILE}
In the forthcoming sections, we provide a comprehensive overview of the workflow involved in training an LLM using the train.py
script.
Preliminary about Hydra
In this project, we use Hydra with yaml file to configure all experiments, including both training and inference. Although we have provided some scripts to automatically generate config file, you may need to have a basic understanding how we use Hydra to manage our experiments.
Despite simple hyper-parameter configuration, the main feature we prefer to use hydra is dynamically function calling, which enables decoupled
module implements, including training & inference workflow, data processing, and model initialization.
Another approach to implement this is through module registration, like that in Fairseq
or OpenMMLab
. However, the registration needs
to load all registered modules at the very beginning, which will lead to high latency when the project becoming larger and difficult to manage
for fast iteration.
Now, let’s take a look at an example for data loading. In general_util.training_utils
, we use load_and_cache_examples
to load dataset.
Then you can find following code snippet to initialize dataset:
dataset = hydra.utils.call(cfg, file_path=file_path, tokenizer=tokenizer)
where cfg.read_tensor
points to a field in the configuration as follows:
read_tensor:
_target_: data.collators.zh_instruct.TextDatasetUnifyV3
pair_file_list: data/files/c4/en/p25/partition_*.json
Here, the _target_
fields refers to the path of the function you want to call during runtime, following which is the name-based arguments.
_target_
can also point to a class (like the above example), in which case the __init__
method of the class will be called.
Some parameters can also be specified regularly in hydra.utils.call
method.
This is what you should take care by defining a common interface shared by all modules.
Benefiting from the above feature, you can define any workload by yourself as it returns a Dataset object and do not need to explicitly import it in the main script.
Data preparation
The first step is to prepare the training datasets for PandaLLM. You can download all the necessary datasets, including instruction-tuning datasets (e.g., Wiki-zh) and pretraining datasets (e.g., Wudao), here. Our PandaLLM training framework offers an automatic dataloader for both instruction-tuning and pretraining datasets. The datasets should be in the .json
format.
vi
In the example below, an instruction-tuning data consists of the "input"
and "target"
. Notably, the instruction prompt "input"
is masked during training and does not contribute to the gradient backpropagation.
{"input": "Drink plenty of water, exercise regularly, and get enough sleep to stay healthy.",
"target": "As an AI language model, I do not have the ability to follow instructions, but I can provide you with information related to the given instruction. Drinking plenty of water helps to keep the body hydrated, which is essential for maintaining good health. Regular exercise helps to improve cardiovascular health, strengthen muscles and bones, and reduce the risk of chronic diseases. Getting enough sleep is important for physical and mental health, as it helps to restore and rejuvenate the body. Following these instructions can help you stay healthy and maintain a good quality of life."}
In the example below, an pretraining data consists of "title"
and "content"
. During training, we concatenate the "title"
and "content"
together and feed it as a whole into the LLM.
{"title": "Singapore lion dance troupe clinches gold at Genting championship, breaking Malaysia's 13-year winning streak",
"content": "Original Title: Singapore lion dance troupe clinches gold at Genting championship, breaking Malaysia's 13-year winning streak The winning team from Singapore Yiwei Athletic Association impressed judges with its flexibility in pile jumping and successfully presenting various expressions on the lion. SINGAPORE: A lion dance troupe from Singapore emerged champion at the Genting World Lion Dance Championships on Sunday (Aug 6), breaking a 13-year winning streak held by Malaysian teams. Singapore's Yiwei Athletic Association fielded two teams to compete at the three-day championship organised by Resorts World Genting in Malaysia. Its Team B secured the win with 9.73 points at the finals on Sunday afternoon, thanks to its flexibility in pile jumping, successfully navigating challenging movements on the tightrope, as well as being able to present the lion's expressions of joy, anger, surprise, and doubt, according to a China Press report. Meanwhile, the association's Team A came in third with 9.58 points. The Khuan Loke Dragon and Lion Dance Association from Selangor in Malaysia was second with 9.64 points. The triumph caps a string of wins by Yiwei over the past years. A team from the association won the first Prime Minister’s Cup International High Pole Lion Dance Championship in Kuala Lumpur in September last year, taking home the top prize of RM38,000 (US$8,300). The Genting championship, on its 14th run, attracted a total of 36 teams from around the world this year, including the United States, France and Australia. Malaysian troupes held the top spot at the past 13 competitions, reported China Press. The Muar Guansheng Temple Dragon and Lion Dance Troupe from Johor took 12 championships, while the Kedah Hongde Sports Association Dragon and Lion Dance Troupe won one. China Press also said that the winning team will receive US$15,000 in cash, trophies and medals. The first and second runners-up will receive US$8,000 and US$5,000 in cash, alongside trophies and medals."}
For compatibility purposes, please store all instruction-tuning datasets under the ./dataset/instruction_tuning
directory, and pretraining datasets under the ./dataset/pretraining
directory. If you wish to train LLMs with a custom dataset, you can specify its directory using the following command:
(pandallm) $ python train.py --instruction_tuning_data_dir ${DIR_TO_YOUR_INSTUCT_DATA} --pretraining_data_dir ${DIR_TO_YOUR_PRETRAIN_DATA}
Please replace ${DIR_TO_YOUR_INSTRUCT_DATA}
and ${DIR_TO_YOUR_PRETRAIN_DATA}
with the respective directories for your custom instruction-tuning and pretraining datasets.
Additionally, you can further customize the dataloader by specifying the following arguments.
- --num_workers
This argument determines the number of worker processes to use for data loading during training. Increasing the number of workers can accelerate data loading. The default value is set to \(2\).
- --prefetch_factor
This argument determines the number of batches to prefetch. Prefetching allows the dataloader to load and prepare the next batches in advance, reducing the waiting time during training. The default value is set to \(2\).
- --max_seq_length
This argument defines the maximum sequence length allowed for input texts during training. Any input sequence exceeding this length will be truncated or split into multiple parts. The default value is set to \(2048\).
Models
The PandaLLM framework support various LLM architectures, and you can specify the model type using the --model
argument as shown below:
(pandallm) $ python train.py --model ${MODEL_TYPE}
Here are the supported LLM architectures.
Architectures |
|
---|---|
|
|
|
|
|
|
|
|
You can finetune a LLM based on a custom checkpoint by specifying the "--ckpt_path"
argument. For example, to finetune a LlaMA-7B
model using the latest checkpoint, execute the following command:
(pandallm) $ python train.py --model llama-7b --ckpt_path pretrain/llama-7b
This command will initiate the fine-tuning process for the llama-7b
model, utilizing a specified ./pretrain/llama-7b
checkpoint. Beside the LlaMA checkpoints, you can also download all the PandaLLM checkpoints from the official PandaLLM GitHub repository.
To fine-tune your custom LLM model, follow these steps:
Convert your LLM checkpoint into the
Huggingface
format and save it to./pretrained-models/FOLDER_OF_YOUR_LLM
.Execute the following command
(pandallm) $ python train.py --model llama-7b --ckpt_path ${FOLDER_OF_YOUR_LLM}
This command will initiate the fine-tuning process using the
llama-7b
model and the checkpoint from your specified directory (./pretrained-models/FOLDER_OF_YOUR_LLM
).
Optimization
General settings
The PandaLLM framework provides several features for training, including automatic gradient accumulation, NVLAMB optimizer integration, and quantization-aware training based on BitsandBytes. To customize the training hyperparameters, you can specify the following arguments. Here is a description of each argument:
- --per_gpu_train_batch_size
The batch size for each GPU during training. The default value is \(1\).
- --per_gpu_eval_batch_size
The batch size for each GPU during evaluation. The default value is \(2\).
- --optimizer
The training optimizer. The default value is
"AdamW"
.- --learning_rate
The learning rate for each batch of the model during training. The default value is \(0.001\).
- --lr_scheduler
The learning rate scheduler options, including
"linear"
,"cosine"
,"constant"
,"poly"
, and"warmup"
. The default value is"warmup"
when the argument is not specified.- --gradient_accumulation_steps
Number of gradient accumulation steps before performing a backward/update pass. The default value is \(64\).
- --weight_decay
The weight decay applied to all parameters of the model. The default value is \(0.00\).
- --adam_epsilon
\(\varepsilon\) value for the Adam optimizer. The default value is \(10^{-6}\).
- --adam_betas
\(\beta\) coefficients used for computing moving averages of gradients and squared gradients in the Adam optimizer. The default value is \((0.9, 0.99)\).
- --max_grad_norm
Maximum norm for gradient clipping. The default value is \(0.3\).
- --num_train_epochs
The total number of training epochs. The default value is \(1\).
- --max_steps
The maximum number of training steps. The default value is \(-1\), indicating no maximum limit.
- --warmup_proportion
Proportion of training steps to perform linear learning rate warmup. The default value is \(0\).
- --warmup_steps
Number of warmup steps for learning rate warmup. The default value is \(50\).
- --bit_training
This
boolean
argument specifies the bit training mode for quantization-aware training. It determines the precision of weights and activations during training. The default value isFalse
.
To finetune a Panda-7B
model with a learning rate of \(0.002\) for \(2\) epochs, execute the following command:
(pandallm) $ python train.py --model llama-7b --ckpt_path chitanda/llama-panda-zh-7b-delta --learing_rate 2e-3 --num_train_epochs 2
Low-rank adaptation (LoRA)
PandaLLM supports LoRA finetuning for LLMs. For example, to initiate the training process for the LlaMA-65B
model with LoRA, execute the following command:
(pandallm) $ python train.py --model llama-65b --use_lora --lora_r 64 --lora_alpha 16 --lora_dropout 0.05
You can customize the behavior of LoRA during the training process of LLMs by specifying the following arguments.
- --use_lora
This
boolean
argument enables the usage of LoRA (Local Relevance Adaptation) during the training process. When specified, LoRA will be incorporated into the training of LLMs.- --lora_r
This argument determines the number of local neighbors considered for each token during LoRA adaptation. The default value is set to \(64\).
- --lora_alpha
This argument controls the strength of adaptation for LoRA. It influences the extent to which the model adapts to local relevance. The default value is set to \(16\).
- --lora_dropout
This argument specifies the dropout rate to apply during LoRA adaptation. Dropout helps to regularize the training process and prevent overfitting. The default value is set to \(0.05\).
Quantization-aware training
PandaLLM enables quantization-aware training based on the BitsandBytes framework. For example, to train a LlaMA-65B
model using BitsandBytes quantization scheme with \(4\)-bit precision, execute the following command:
(pandallm) $ python train.py --model llama-65b --use_quant
Inference
In this project, we support several approaches for inference:
HuggingFace Transformers’ naive model parallel
Deepspeed inference (tensor parallel)
Tensor parallel (supported by
tensor-parallel
pypi package)
HuggingFace Transformers’ Model Parallel
This feature can be quite easy to be enabled by specify device_map
during calling xxx.from_pretrained
method.
Note that you this is not distributed evaluation, so you cannot launch multiple processes when using this feature.
Deepspeed Inference
Enabling this feature should use a different entrance script ds_inference.py. You can launch the process by:
deepspeed --include localhost:0,1,2,3 ds_inference.py -cp <config path> -cn <config file name>
Note that this will launch multiple processes at the same time, however, this is still not distributed evaluation, since all processes should load exactly the same data (tensor parallel). To this end, we have explicitly set ddp_eval=True at the entrance script.
Besides, since tensor parallel requires huge communication across different processes, it is not recommended to use this feature across different nodes. For single node inference, you may need to try if this is faster than naive model parallel, which depends on the bandwidth of your machine.
Tensor Parallel
Deploy your LLM
Coming Soon …
Pipeline Parallelism
Preliminary
Pipeline parallelism (PP) is an advanced technique enabling efficient model parallel training, where different layers are put on different GPUs and the forward and backward passes are executed in a pipelined fashion, which can reduce the memory usage of each GPU and greatly improve the GPU utilization rate compared with naive model parallelism.
Compared with the extreme case that using ZeRO-2/3 with offload to train models, PP can greatly improve the training efficiency and reduce the memory usage. In our Panda Project, we provide two example implementations for popular open-sourced models, LLaMA and MPT. Since PP requires specific modification to original model implementation, we cannot cover models. So one important goal of this tutorial is to provide a template as well as some instructions so that you can quickly adapt it to your own usage and focus on the implementation of model.
Through the former sections, we believe you have already a general understanding of our training pipeline and how hydra-based dynamic configuration works. In order to adapt PP into your own case, you can generally follow the following step: 1. Create your own data processing pipeline. 2. Create the specific pipeline parallel wrapping for any not included model. 3. Run it.
And in the following sections, we will introduce the modification of the main training loop compared with the former one, and tackle the possible problems you may encounter during implementing your onw model.
Core Code Snippets
Model Implementation
First of all, currently all pipeline parallelism implementation requires you to use nn.Sequential to re-organize our model, and the inputs/outputs should be tuple.
This is used for asynchronous forward and backward passes. The easiest way to do this is adding a simple wrapper to inherit Transformer layer and override the forward
function
for inputs unpacking and outputs packing. For example, the code snippet of LLaMA layer is as follows:
class ParallelTransformerLayerPipe(LlamaDecoderLayer):
def __init__(self, config: LlamaConfig, activation_checkpointing: bool = False):
super().__init__(config)
self.activation_checkpointing = activation_checkpointing
def forward(self, args):
if self.activation_checkpointing:
return self._ckpt_forward(args)
hidden_states, attention_mask, position_ids = args
outputs = LlamaDecoderLayer.forward(self,
hidden_states,
attention_mask,
position_ids,
)
return outputs[0], attention_mask, position_ids
def _ckpt_forward(self, args):
hidden_states, attention_mask, position_ids = args
def create_custom_forward(module):
def custom_forward(*inputs):
return LlamaDecoderLayer.forward(module, *inputs)
return custom_forward
# deepspeed checkpoint auto use outputs[0] if len(outputs) == 1
outputs = deepspeed.checkpointing.checkpoint(
create_custom_forward(self),
hidden_states,
attention_mask,
position_ids,
None,
)
return outputs, attention_mask, position_ids
Similarly, you can also implement the wrapper for nn.Embedding
and LayerNorm
, so that the final layers are like the following:
def get_layers_from_config(model_config, activation_checkpointing: bool = False):
"""
`tie_word_embeddings` in LLaMA is set to `false`.
"""
layers = [
LayerSpec(EmbeddingPipe, model_config.vocab_size, model_config.hidden_size),
*[LayerSpec(ParallelTransformerLayerPipe, model_config, activation_checkpointing)
for _ in range(model_config.num_hidden_layers)],
LayerSpec(LayerNormPipe, model_config.hidden_size, model_config.rms_norm_eps),
LayerSpec(LMLayerPipe, model_config.hidden_size, model_config.vocab_size, bias=False),
]
return layers
where LayerSpec
is a special class provided by DeepSpeed for post-initialization and we will introduce it in the next section.
For loss function, you can either define a class inheriting nn.Module
and add it to nn.Sequential
or List
directly,
or defining a callable function.
The difference between these two approaches is that the input to the former one is still the tuple output of the last layer of model.
In this case, you should pass labels
from the first layer to the last layer.
For the second one, the inputs to the loss function is a tuple of (outputs, labels)
, where outputs
is the from the last layer of the model,
and labels
directly come from the data loader. We provided two examples cases for the two approaches:
# nn.Module based approach
class LossLayer(torch.nn.Module):
def forward(self, args):
logits, labels = args
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits.reshape(-1, shift_logits.size(-1)), shift_labels.reshape(-1))
return loss
# Function based approach
def loss_fn(outputs, labels):
logits = outputs
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits.reshape(-1, shift_logits.size(-1)), shift_labels.reshape(-1))
return loss
And no matter which method you use, the return value of the collator should be like the following:
return (
(input_ids, attention_mask, other_inputs), # The inputs to the first layer
labels, # The labels, and will be passed to the loss function.
)
It’s indeed a tuple over tuple. And for the second case, you should specify the loss function at the DeepSpeed PipelineModule
like:
model_pipe = PipelineModule(layers=layers,
num_stages=cfg.num_stages,
loss_fn=pp_loss_fn, # Specify the callable loss function here.
partition_method=getattr(cfg, "partition_method", "parameters"),
activation_checkpoint_interval=getattr(cfg, "activation_checkpoint_interval", 0)
)
Model initialization
There are two main approaches to enable model initialization and loading pre-trained weights. One is first initializing the model using the from_pretrained
function.
In this case, you may refer to models.llama_ds_mp_wrap.get_model
for details.
The drawback of this method is that it will load the whole model for each worker. This will cause out-of-CPU-memory-usage when the model is large.
Another method is first initializing the sharded models with DeepSpeed’s LayerSpec
class to implement post-initialization after pipeline parallelism partition.
Then each rank only need to load the pre-trained weights for each own partition:
model_or_config = transformers.AutoConfig.from_pretrained(cfg.model_name_or_path)
layers = models.llama_ds_mp_wrap.get_layers_from_config(model_or_config)
model_pipe = PipelineModule(layers=layers,
num_stages=cfg.num_stages,
loss_fn=models.llama_ds_mp_wrap.loss_fn,
activation_checkpoint_interval=getattr(cfg, "activation_checkpoint_interval", 0)
)
...
model.load_checkpoint(cfg.model_name_or_path, load_module_only=True, load_optimizer_states=False, load_lr_scheduler_states=False)
Note that the pre-trained weights should be converted from HF format by using convert2ckpt.py
.
### Hybrid Training of Pipeline Parallelism (PP) and Distributed Data Parallel (DP)
When dist.world_size > num_stages
, hybrid training is automatically enabled. The number of stages of pipeline parallel (PP) is num_stages
while the degree of data-parallel (DP) is dist.world_size // num_stages
.
### No Weight Typing of Word Embedding
Different from traditional pre-trained language models, LLaMA do not need weight typing. So do not use TiedLayerSpec
to wrap embed_tokens
and lm_head
modules.
The implementation of MPT
has included weight typing and you can refer to it for details.
### Distributed Sampler Setting
When hybrid training of PP and DP is enabled, DistributedSampler
should be carefully set for each rank w.r.t. its state (PP stage and DP group).
The core code snippet is as follows:
dp_degree = dist.get_world_size() // cfg.num_stages
if dp_degree > 1:
dp_id = model.grid.get_data_parallel_id()
sub_train_sampler = DistributedSampler(sub_train_dataset, num_replicas=dp_degree, rank=dp_id)
else:
sub_train_sampler = RandomSampler(sub_train_dataset)
Data Fetch Design of DeepSpeed and CPU Memory Reduction
In DeepSpeed design, among specific PP group, only the first and the last rank, i.e., stage=0 or stage=num_stages - 1
,
will fetch minibatch from dataloader, and the other ranks never fetch data.
Based on this, for the ranks where the dataloader will never be used, we can use placeholders to allocate the memory usage. This could be especially useful when training large models.
For example, when training LLaMA-65B with offload_optimizer=True
and num_stages=8
, the CPU memory usage is already nearly 800GB,
which will cause CPU memory OOM when you are using large dataset.
The code of dataset placeholder is as follows:
def load_empty_dataset_and_collator(cfg: DictConfig):
from data.test import TestDataset
from data.flan import FlanCollatorOverCollator
dataset = TestDataset(None, None, getattr(cfg, "total_dataset_len", -1))
collator = FlanCollatorOverCollator(collator=None,
tokenizer=cfg.model_name_or_path,
max_seq_length=128,
decoder_only=True,
return_standard_inputs=True,
)
# Keep consistent with `load_and_cache_examples`.
if getattr(cfg, "dist_load_data_barrier", True):
dist.barrier()
if dist.is_initialized():
dist.barrier()
return dataset, collator
if model.is_first_stage() or model.is_last_stage():
sub_train_dataset = load_and_cache_examples(cfg, tokenizer, _split="train", _file=_file)
if dp_degree > 1:
dp_id = model.grid.get_data_parallel_id()
sub_train_sampler = DistributedSampler(sub_train_dataset, num_replicas=dp_degree, rank=dp_id)
else:
sub_train_sampler = RandomSampler(sub_train_dataset)
sub_train_collator = hydra.utils.instantiate(cfg.collator) if "collator" in cfg and cfg.collator else None
sub_train_dataloader = DataLoader(dataset=sub_train_dataset,
sampler=sub_train_sampler,
batch_size=cfg.train_batch_size,
collate_fn=sub_train_collator,
num_workers=cfg.num_workers,
pin_memory=True,
prefetch_factor=cfg.prefetch_factor,
drop_last=True,
)
else:
sub_train_dataset, sub_train_collator = load_empty_dataset_and_collator(cfg)
sub_train_sampler = None
sub_train_dataloader = DataLoader(dataset=sub_train_dataset,
batch_size=cfg.train_batch_size,
collate_fn=sub_train_collator,
drop_last=True,
shuffle=False)
where TestDataset
is an empty dataset and the collator is arbitrary one meeting the input format.
Know Problems and Possible Solutions
BF16 Support
Bfloat16 can be used by setting the following in deepspeed config:
data_types:
grad_accum_dtype: "fp32"
However, bfloat16 cannot be used with optimizer offload. Note that pipeline parallelism is designed not to support optimizer offload (see issue [#3866](https://github.com/microsoft/DeepSpeed/issues/3866)). Nevertheless, it can still be enabled under fp16 training.
Torch Compile
Torch compilation is not supported in the template, which perhaps becuase my writing is incorrect.
Reference & Acknowledgement
[llama-deepspeed](https://github.com/HuangLK/llama-deepspeed/tree/main)
[ChatGLM-Finetuning](https://github.com/liucongg/ChatGLM-Finetuning)
[DeepSpeed Pipeline Parallelism Tutorial](https://www.deepspeed.ai/tutorials/pipeline/)
快速开始
我们以部署和训练 Panda-13B 为例。
安装
从Github下载我们的代码
$ git clone https://github.com/dandelionsllm/pandallm
在新的环境中安装所需的依赖
$ conda create -n pandallm python=3.10
$ conda activate pandallm
(pandallm) $ pip install -r requirements.txt
(pandallm) $ mkdir pretrained_model
快速部署
从 Huggingface 下载
LlaMA-13B
。从Huggingface下载我们的模型。由于模型文件对于git clone来说太大了,您可以从 这里 手动下载模型文件。
(pandallm) $ mkdir delta-models
(pandallm) $ cd delta-models
(pandallm) $ git clone https://huggingface.co/chitanda/llama-panda-13b-zh-wudao-chat-delta
将下载的文件移动到相应的目录。
(pandallm) $ cd ..
(pandallm) $ mv delta-models/ ./
将
"delta-model"
转换为预训练的模型。请将 ${PATH_TO_YOUR_MODEL} 替换为您想要保存的模型路径。
(pandallm) $ python apply_delta.py --base_model ${PATH_TO_YOUR_MODEL} --target_model ./pretrained_model/panda-13B --delta_model ./delta-models/llama-panda-13b-zh-wudao-chat-delta/checkpoint-3000-delta
运行以下命令部署聊天机器人。
(pandallm) $ python run_chat.py --model_path ./pretrained_model/panda-13B --query "write a peom"
快速训练
在直接使用以下命令训练模型之前,请确保您已经完成了 安装.
准备训练数据。您可以从 这里 下载训练数据。请将数据文件夹放在
./dataset
。运行以下命令来训练模型:
(pandallm) $ PAD_TOKEN="</s>" deepspeed --include localhost:0,1,2,3,4,5,6,7 trainer_base_ds_mul.py -cp conf/llama/zh/ -cn llama_13b_zh_instruct_sft_combine_v1_0_ds
如果您的服务器上少于 \(8\) 个GPUs,您可以将 --include 参数
更改为您拥有的GPUs,例如 "--include localhost:0,1,2,3"
如果您在一个服务器上有 \(4\) GPUS。
训练您的大模型(LLM)
PandaLLM 通过利用 DeepSpeed 加速框架和 FairScale 并行框架,实现了各种LLM的高效训练。您可以使用以下命令使用自定义配置训练您的LLM:
(pandallm) $ python train.py --model llama-7b
当您执行 train.py
脚本时,它会根据位于 ./conf/template.yaml
的配置模板文件自动生成位于 ./conf/tmp.yaml
的训练配置文件。随后,脚本通过执行 ./trainer_torch_fsdp_wandb.py
启动训练过程。如果您喜欢使用个人化配置训练您的模型,可以执行以下命令:
(pandallm) $ python train.py --conf_path ${PATH_TO_YOUR_CONF_FILE}
在接下来的部分中,我们将提供使用 train.py
脚本训练LLM所涉及的工作流程的全面概述。
关于 Hydra 的初步介绍
在此项目中,我们使用 Hydra 配合 yaml 文件来配置所有实验,包括训练和推理。虽然我们提供了一些脚本来自动生成配置文件,但您可能需要基本了解我们如何使用 Hydra 管理实验。
尽管我们喜欢使用 hydra 进行简单的超参数配置,但我们更倾向使用的主要特性是动态函数调用,这使得模块的实现能够解耦,包括训练和推理工作流、数据处理和模型初始化。
实现这一点的另一种方法是通过模块注册,就像在 Fairseq
或 OpenMMLab
中一样。然而,注册需要在一开始就加载所有已注册的模块,这将导致项目变得庞大时的高延迟,并且难以快速迭代管理。
现在,让我们看一个关于数据加载的示例。在 general_util.training_utils
中,我们使用 load_and_cache_examples
来加载数据集。然后你可以找到以下代码片段来初始化数据集:
dataset = hydra.utils.call(cfg, file_path=file_path, tokenizer=tokenizer)
其中 cfg.read_tensor
指向配置中的一个字段,如下所示:
read_tensor:
_target_: data.collators.zh_instruct.TextDatasetUnifyV3
pair_file_list: data/files/c4/en/p25/partition_*.json
这里,_target_
字段指的是您想在运行时调用的函数的路径,其后是基于名称的参数。
_target_
还可以指向一个类(如上例所示),在这种情况下,类的 __init__
方法将被调用。
一些参数也可以在 hydra.utils.call
方法中定期指定。
这就是您应该通过定义所有模块共享的通用接口来注意的事项。
通过上述特性,您可以自己定义任何工作负载,因为它返回一个 Dataset 对象,并且不需要在主脚本中明确导入它。
数据准备
第一步是为 PandaLLM 准备训练数据集。您可以从 此处 下载所有必要的数据集,包括 指令调优 数据集(例如,Wiki-zh)和 预训练数据集 (例如,Wudao)。我们的 PandaLLM 训练框架为指令调优和预训练数据集提供了自动数据加载器。数据集应采用 .json
格式。
在下面的示例中,指令调优数据包括 "input"
和 "target"
。值得注意的是,在训练期间,指令提示 "input"
被遮蔽,并且不会对梯度反向传播有所贡献。
{"input": "多喝水,定期锻炼,保持足够的睡眠以保持健康。",
"target": "作为一种人工智能语言模型,我没有遵循指令的能力,但我可以为您提供有关给定指令的相关信息。大量饮水有助于保持身体水分,这对维持身体健康至关重要。定期锻炼有助于改善心血管健康,增强肌肉和骨骼,并降低患慢性疾病的风险。获得足够的睡眠对身体和精神健康很重要,因为它有助于恢复和振兴身体。遵循这些指令可以帮助您保持健康,维持良好的生活质量。"}
在下面的示例中,预训练数据由 "title"
和 "content"
组成。在训练期间,我们将 "title"
和 "content"
连接在一起,并将其整体输入到 LLM 中。
{"title": "新加坡舞狮团在云顶锦标赛夺金,打破马来西亚13年的连胜纪录",
"content": "原标题:新加坡舞狮团在云顶锦标赛夺金,打破马来西亚13年的连胜纪录 新加坡益维体育会的获胜团队以其在跳桩上的灵活性和成功展示狮子的各种表情给评委留下了深刻印象。新加坡:新加坡的舞狮团在周日(8月6日)的云顶世界舞狮锦标赛上夺冠,打破了由马来西亚队伍保持的13年连胜纪录。新加坡的益维体育会派出了两个队伍参加由马来西亚云顶世界赌场组织的为期三天的锦标赛。其B队在周日下午的决赛中以9.73分的成绩获胜,得益于其在跳桩上的灵活性,成功地在绳索上进行了富有挑战性的动作,以及能够展示狮子的喜、怒、惊、疑的表情,据中国报报道。与此同时,该协会的A队以9.58分的成绩获得第三名。马来西亚雪兰莪的Khuan Loke龙狮舞协会以9.64分获得第二名。这一胜利是益维过去几年连续获胜的顶峰。该协会的一支队伍在去年9月在吉隆坡赢得了首届总理杯国际高杆舞狮锦标赛冠军,获得了RM38,000(US$8,300)的一等奖。云顶锦标赛,今年是第14届,共吸引了来自世界各地的36个团队参加,包括美国、法国和澳大利亚。据中国报报道,马来西亚团队在过去的13次比赛中一直保持着榜首地位。柔佛的Muar Guansheng Temple龙狮舞团获得了12个冠军,而吉打洪德体育会龙狮舞团赢得了一个。中国报还表示,获胜团队将获得15,000美元现金、奖杯和奖牌。第一和第二名将分别获得8,000美元和5,000美元现金,以及奖杯和奖牌。"}
为了兼容性,请将所有指令调优数据集存储在 ./dataset/instruction_tuning
目录下,并将预训练数据集存储在 ./dataset/pretraining
目录下。如果您希望使用自定义数据集训练LLM,您可以使用以下命令指定其目录:
(pandallm) $ python train.py --instruction_tuning_data_dir ${DIR_TO_YOUR_INSTUCT_DATA} --pretraining_data_dir ${DIR_TO_YOUR_PRETRAIN_DATA}
请将 ${DIR_TO_YOUR_INSTRUCT_DATA}
和 ${DIR_TO_YOUR_PRETRAIN_DATA}
替换为您的自定义指令调优和预训练数据集的相应目录。
此外,您还可以通过指定以下参数来进一步自定义数据加载器。
- --num_workers
此参数确定在训练期间用于数据加载的工作进程数量。增加工作人员数量可以加速数据加载。默认值设置为 \(2\)。
- --prefetch_factor
此参数确定要预取的批次数量。预取允许数据加载器提前加载和准备下一批数据,从而减少训练期间的等待时间。默认值设置为 \(2\)。
- --max_seq_length
此参数定义训练期间输入文本的最大序列长度。任何超过此长度的输入序列将被截断或分成多个部分。默认值设置为 \(2048\)。
模型
PandaLLM框架支持多种LLM架构,您可以使用以下的 --model
参数来指定模型类型:
(pandallm) $ python train.py --model ${MODEL_TYPE}
以下是支持的LLM架构。
架构 |
|
---|---|
|
|
|
|
|
|
|
|
您可以通过指定 "--ckpt_path"
参数,根据自定义检查点对LLM进行微调。例如,要使用最新检查点微调 LlaMA-7B
模型,请执行以下命令:
(pandallm) $ python train.py --model llama-7b --ckpt_path pretrain/llama-7b
该命令将启动针对 llama-7b
模型的微调过程,使用指定的 ./pretrain/llama-7b
检查点。除了LlaMA检查点,您还可以从 PandaLLM官方GitHub仓库 下载所有PandaLLM检查点。
要微调您的自定义LLM模型,请按照以下步骤操作:
将您的LLM检查点转换为
Huggingface
格式,并保存到./pretrained-models/FOLDER_OF_YOUR_LLM
。执行以下命令
(pandallm) $ python train.py --model llama-7b --ckpt_path ${FOLDER_OF_YOUR_LLM}
该命令将使用
llama-7b
模型和您指定目录(./pretrained-models/FOLDER_OF_YOUR_LLM
)中的检查点启动微调过程。
优化
通用设置
PandaLLM框架为训练提供了几个功能,包括自动梯度累积,NVLAMB 优化器集成,以及基于 BitsandBytes 的量化感知训练。要自定义训练超参数,您可以指定以下参数。下面是每个参数的描述:
- --per_gpu_train_batch_size
训练期间每个GPU的批量大小。默认值为 \(1\)。
- --per_gpu_eval_batch_size
评估期间每个GPU的批量大小。默认值为 \(2\)。
- --optimizer
训练优化器。默认值为
"AdamW"
。- --learning_rate
训练期间模型的每批学习率。默认值为 \(0.001\)。
- --lr_scheduler
学习率调度器选项,包括
"linear"
,"cosine"
,"constant"
,"poly"
, 和"warmup"
。当参数未指定时,默认值为"warmup"
。- --gradient_accumulation_steps
在执行向后/更新传递之前的梯度积累步骤数。默认值为 \(64\)。
- --weight_decay
应用于模型所有参数的权重衰减。默认值为 \(0.00\)。
- --adam_epsilon
Adam优化器的 \(\varepsilon\) 值。默认值为 \(10^{-6}\)。
- --adam_betas
在Adam优化器中用于计算梯度和平方梯度移动平均值的 \(\beta\) 系数。默认值为 \((0.9, 0.99)\)。
- --max_grad_norm
梯度裁剪的最大范数。默认值为 \(0.3\)。
- --num_train_epochs
训练的总时期数量。默认值为 \(1\)。
- --max_steps
最大训练步骤数。默认值为 \(-1\),表示没有最大限制。
- --warmup_proportion
执行线性学习率预热的训练步骤的比例。默认值为 \(0\)。
- --warmup_steps
学习率预热的预热步数。默认值为 \(50\)。
- --bit_training
这个
boolean
参数指定了量化感知训练的位训练模式。它决定了训练过程中权重和激活的精度。默认值为False
。
要以 \(0.002\) 的学习率对 Panda-7B
模型进行 \(2\) 个时期的微调,请执行以下命令:
(pandallm) $ python train.py --model llama-7b --ckpt_path chitanda/llama-panda-zh-7b-delta --learing_rate 2e-3 --num_train_epochs 2
低秩适应 (LoRA)
PandaLLM支持使用 LoRA 微调LLM。例如,要使用LoRA启动 LlaMA-65B
模型的训练过程,请执行以下命令:
(pandallm) $ python train.py --model llama-65b --use_lora --lora_r 64 --lora_alpha 16 --lora_dropout 0.05
您可以通过指定以下参数来自定义LoRA在LLM训练过程中的行为。
- --use_lora
此
boolean
参数在训练过程中启用 LoRA 。指定后,LoRA将整合到LLM的训练中。- --lora_r
此参数确定在LoRA适应期间每个令牌所考虑的本地邻居数量。默认值设置为 \(64\)。
- --lora_alpha
此参数控制LoRA的适应强度。它影响模型对局部关联的适应程度。默认值设置为 \(16\)。
- --lora_dropout
此参数指定在LoRA适应期间应用的退出率。退出有助于规范训练过程并防止过度拟合。默认值设置为 \(0.05\)。
Quantization-aware training
PandaLLM基于 BitsandBytes 框架启用Quantization-aware training。例如,要使用 BitsandBytes 量化方案训练具有 \(4\) 位精度的 LlaMA-65B
模型,请执行以下命令:
(pandallm) $ python train.py --model llama-65b --use_quant
部署你的大模型( LLM )
Coming Soon …
流水线并行
初步
流水线并行(PP)是一种先进的技术,使得有效的模型并行训练成为可能,其中不同的层被放置在不同的GPU上,并且前向和后向传递以流水线方式执行, 这可以减少每个GPU的内存使用并与简单模型并行相比大大提高GPU利用率。
与使用 ZeRO-2/3 与卸载训练模型的极端情况相比,流水线并行可以大大提高训练效率并减少内存使用。 在我们的 Panda 项目中,我们为流行的开源模型 LLaMA 和 MPT 提供了两个示例实现。由于PP需要对原始模型实现进行特定修改,我们无法覆盖所有模型。因此,本教程的一个重要目标是提供一个模板以及一些说明,以便您可以快速将其适应自己的用途并专注于模型的实现。
通过前面的部分,我们相信您已经对我们的训练流程以及基于 hydra 的动态配置如何工作有了一般的了解。 为了将PP适应到您自己的情况,您通常可以按照以下步骤操作: 1. 创建自己的数据处理管道。 2. 为任何未包括的模型创建特定的流水线并行包装。 3. 运行它。
在以下部分,我们将介绍对主要训练流程的修改,并解决您在实现自己的模型过程中可能遇到的问题。
核心代码片段
模型实现
首先,目前所有的流水线并行实现都要求您使用 nn.Sequential 来重新组织我们的模型,输入/输出应该是元组。
这用于异步前向和后向传递。实现这一点的最简单方法是添加一个简单的包装器来继承Transformer层,并重写 forward
函数
用于输入解包和输出打包。例如,LLaMA层的代码片段如下:
class ParallelTransformerLayerPipe(LlamaDecoderLayer):
def __init__(self, config: LlamaConfig, activation_checkpointing: bool = False):
super().__init__(config)
self.activation_checkpointing = activation_checkpointing
def forward(self, args):
if self.activation_checkpointing:
return self._ckpt_forward(args)
hidden_states, attention_mask, position_ids = args
outputs = LlamaDecoderLayer.forward(self,
hidden_states,
attention_mask,
position_ids,
)
return outputs[0], attention_mask, position_ids
def _ckpt_forward(self, args):
hidden_states, attention_mask, position_ids = args
def create_custom_forward(module):
def custom_forward(*inputs):
return LlamaDecoderLayer.forward(module, *inputs)
return custom_forward
# deepspeed checkpoint auto use outputs[0] if len(outputs) == 1
outputs = deepspeed.checkpointing.checkpoint(
create_custom_forward(self),
hidden_states,
attention_mask,
position_ids,
None,
)
return outputs, attention_mask, position_ids
同样,您也可以为 nn.Embedding
和 LayerNorm
实现 Wrapper ,以便最后一层看起来像下面这样:
def get_layers_from_config(model_config, activation_checkpointing: bool = False):
"""
`tie_word_embeddings` in LLaMA is set to `false`.
"""
layers = [
LayerSpec(EmbeddingPipe, model_config.vocab_size, model_config.hidden_size),
*[LayerSpec(ParallelTransformerLayerPipe, model_config, activation_checkpointing)
for _ in range(model_config.num_hidden_layers)],
LayerSpec(LayerNormPipe, model_config.hidden_size, model_config.rms_norm_eps),
LayerSpec(LMLayerPipe, model_config.hidden_size, model_config.vocab_size, bias=False),
]
return layers
其中 LayerSpec
是DeepSpeed为后初始化提供的特殊类,我们将在下一节中介绍它。
对于损失函数,您可以定义一个继承 nn.Module
的类并将其添加到 nn.Sequential
或 List
中,
或者定义一个可调用函数。
这两种方法的区别在于,前者的输入仍然是模型最后一层的元组输出。
在这种情况下,您应该从第一层传递到最后一层的 labels
。
对于第二个,损失函数的输入是一个 (outputs, labels)
的元组,其中 outputs
来自模型的最后一层,
labels
直接来自数据加载器。我们为这两种方法提供了两个示例情况:
# nn.Module based approach
class LossLayer(torch.nn.Module):
def forward(self, args):
logits, labels = args
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits.reshape(-1, shift_logits.size(-1)), shift_labels.reshape(-1))
return loss
# Function based approach
def loss_fn(outputs, labels):
logits = outputs
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits.reshape(-1, shift_logits.size(-1)), shift_labels.reshape(-1))
return loss
无论您使用哪种方法,collator的返回值应该如下所示:
return (
(input_ids, attention_mask, other_inputs), # The inputs to the first layer
labels, # The labels, and will be passed to the loss function.
)
它确实是一个元组包含元组。对于第二种情况,您应该在 DeepSpeed PipelineModule
中指定损失函数,如下所示:
model_pipe = PipelineModule(layers=layers,
num_stages=cfg.num_stages,
loss_fn=pp_loss_fn, # Specify the callable loss function here.
partition_method=getattr(cfg, "partition_method", "parameters"),
activation_checkpoint_interval=getattr(cfg, "activation_checkpoint_interval", 0)
)
模型参数初始化
有两种主要方法来启用模型初始化并加载预训练权重。一种是首先使用 from_pretrained
函数初始化模型。
在这种情况下,您可以参考 models.llama_ds_mp_wrap.get_model
了解详情。
这种方法的缺点是它会为每个工作器加载整个模型。当模型很大时,这将导致CPU内存耗尽。
另一种方法是首先使用DeepSpeed的 LayerSpec
类初始化分片模型,以在管道并行分区后实施初始化。
然后,每个等级只需要为每个自己的分区加载预训练权重:
model_or_config = transformers.AutoConfig.from_pretrained(cfg.model_name_or_path)
layers = models.llama_ds_mp_wrap.get_layers_from_config(model_or_config)
model_pipe = PipelineModule(layers=layers,
num_stages=cfg.num_stages,
loss_fn=models.llama_ds_mp_wrap.loss_fn,
activation_checkpoint_interval=getattr(cfg, "activation_checkpoint_interval", 0)
)
...
model.load_checkpoint(cfg.model_name_or_path, load_module_only=True, load_optimizer_states=False, load_lr_scheduler_states=False)
注意,预训练的权重应该通过使用 convert2ckpt.py
从HF格式转换。
### 管道并行(PP)和分布式数据并行(DP)的混合训练
当 dist.world_size > num_stages
时,将自动启用混合训练。管道并行(PP)的阶段数是 num_stages
而数据并行(DP)的程度是 dist.world_size // num_stages
。
### 无权重类型的词嵌入
与传统的预训练语言模型不同,LLaMA 不需要权重类型化。因此,不要使用 TiedLayerSpec
来包装 embed_tokens
和 lm_head
模块。
MPT
的实现已经包括了权重类型化,您可以参考以了解详情。
### 分布式采样器设置
当启用PP和DP的混合训练时,应该小心为每个等级设置其状态(PP阶段和DP组)的 DistributedSampler
。
核心代码片段如下:
dp_degree = dist.get_world_size() // cfg.num_stages
if dp_degree > 1:
dp_id = model.grid.get_data_parallel_id()
sub_train_sampler = DistributedSampler(sub_train_dataset, num_replicas=dp_degree, rank=dp_id)
else:
sub_train_sampler = RandomSampler(sub_train_dataset)
DeepSpeed和CPU内存减少的数据获取设计
在DeepSpeed设计中,特定PP组中,只有第一个和最后一个等级,即 stage=0 or stage=num_stages - 1
,
会从dataloader获取小批量数据,而其他等级永远不会获取数据。
基于此,对于dataloader永远不会使用的等级,我们可以使用占位符来分配内存使用。当训练大型模型时,这可能特别有用。
例如,当使用 offload_optimizer=True
和 num_stages=8
训练LLaMA-65B时,CPU内存使用量已经接近800GB,
当您使用大型数据集时,这将导致CPU内存溢出。
数据集占位符的代码如下:
def load_empty_dataset_and_collator(cfg: DictConfig):
from data.test import TestDataset
from data.flan import FlanCollatorOverCollator
dataset = TestDataset(None, None, getattr(cfg, "total_dataset_len", -1))
collator = FlanCollatorOverCollator(collator=None,
tokenizer=cfg.model_name_or_path,
max_seq_length=128,
decoder_only=True,
return_standard_inputs=True,
)
# Keep consistent with `load_and_cache_examples`.
if getattr(cfg, "dist_load_data_barrier", True):
dist.barrier()
if dist.is_initialized():
dist.barrier()
return dataset, collator
if model.is_first_stage() or model.is_last_stage():
sub_train_dataset = load_and_cache_examples(cfg, tokenizer, _split="train", _file=_file)
if dp_degree > 1:
dp_id = model.grid.get_data_parallel_id()
sub_train_sampler = DistributedSampler(sub_train_dataset, num_replicas=dp_degree, rank=dp_id)
else:
sub_train_sampler = RandomSampler(sub_train_dataset)
sub_train_collator = hydra.utils.instantiate(cfg.collator) if "collator" in cfg and cfg.collator else None
sub_train_dataloader = DataLoader(dataset=sub_train_dataset,
sampler=sub_train_sampler,
batch_size=cfg.train_batch_size,
collate_fn=sub_train_collator,
num_workers=cfg.num_workers,
pin_memory=True,
prefetch_factor=cfg.prefetch_factor,
drop_last=True,
)
else:
sub_train_dataset, sub_train_collator = load_empty_dataset_and_collator(cfg)
sub_train_sampler = None
sub_train_dataloader = DataLoader(dataset=sub_train_dataset,
batch_size=cfg.train_batch_size,
collate_fn=sub_train_collator,
drop_last=True,
shuffle=False)
其中 TestDataset
是一个空数据集,collator是符合输入格式的任意一个。
已知问题和可能的解决方案
BF16支持
通过在deepspeed配置中设置以下内容,可以使用Bfloat16:
data_types:
grad_accum_dtype: "fp32"
然而,bfloat16不能与优化器offload一起使用。请注意,管道并行被设计为不支持优化器offload(请参阅问题[#3866](https://github.com/microsoft/DeepSpeed/issues/3866))。尽管如此,在fp16训练下仍可以启用。
我无法使用原始实现或pytorch 2.0中的 torch.nn.functional.scaled_dot_product_attention 启用 Flash Attention。参见问题[此处](https://github.com/HuangLK/llama-deepspeed/issues/36)和[此处](https://github.com/microsoft/DeepSpeed/issues/3868)。
Torch Compile
模板中不支持Torch Compile,这可能是因为我写得不正确。请多指正。
参考和致谢
[llama-deepspeed](https://github.com/HuangLK/llama-deepspeed/tree/main)
[ChatGLM-Finetuning](https://github.com/liucongg/ChatGLM-Finetuning)
[DeepSpeed Pipeline Parallelism Tutorial](https://www.deepspeed.ai/tutorials/pipeline/)