# LLM Ecosystem Introduction: From Model Fine-Tuning to Application Implementation
# Model fine-tuning
Pre-trained LLMs typically possess broad knowledge, but fine-tuning is essential for them to excel in specific tasks. Here are some commonly used LLM fine-tuning tools:
# Axolotl
Axolotl is a tool designed to simplify the fine-tuning of various AI models, supporting multiple configurations and architectures.
Main Features:
- Train various Huggingface models, such as llama, pythia, falcon, mpt
- Supports fullfinetune, lora, qlora, relora, and gptq
- Customize configuration using simple yaml files or CLI rewrite functions
- Load different dataset formats, use custom formats, or built-in tokenized datasets
- Integrated with xformer, flash attention, rope scaling, and multi-packing
- Can work with a single GPU or multiple GPUs through FSDP or Deepspeed.
- Easily run locally or in the cloud using Docker
- Record the results and optional checkpoints to wandb or mlflow
Quick Start: Requirements: Python >=3.10 and Pytorch >=2.1.1
git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl
pip3 install packaging ninja
pip3 install -e '.[flash-attn,deepspeed]'
Usage:
# preprocess datasets - optional but recommended
CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml
# finetune lora
accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml
# inference
accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
--lora_model_dir="./outputs/lora-out"
# gradio
accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
--lora_model_dir="./outputs/lora-out" --gradio
# remote yaml files - the yaml config can be hosted on a public URL
# Note: the yaml config must directly link to the **raw** yaml
accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml
For more detailed information, please visit the Axolotl project homepage.
# Llama-Factory
Llama-Factory is launched by Meta and is a framework focused on fine-tuning Llama models. It is built on top of the PyTorch ecosystem and provides efficient training and evaluation tools.
Main Features:
- Multiple models: LLaMA, LLaVA, Mistral, Mixtral-MoE, Qwen, Yi, Gemma, Baichuan, ChatGLM, Phi, etc.
- Integration Methods: (Incremental) Pre-training, (Multimodal) Instruction Supervised Fine-tuning, Reward Model Training, PPO Training, DPO Training, KTO Training, ORPO Training, etc.
- Multiple Precisions: 16-bit full parameter fine-tuning, frozen fine-tuning, LoRA fine-tuning, and 2/3/4/5/6/8-bit QLoRA fine-tuning based on AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ.
- Advanced Algorithms: GaLore, BAdam, DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ, PiSSA, and Agent fine-tuning.
- Practical Tips: FlashAttention-2, Unsloth, RoPE scaling, NEFTune, and rsLoRA.
- Experiment Monitoring: LlamaBoard, TensorBoard, Wandb, MLflow, etc.
- Rapid Reasoning: OpenAI-style API, browser interface, and command-line interface based on vLLM.
Performance Metrics
Compared to the P-Tuning fine-tuning by ChatGLM official, the LoRA fine-tuning by LLaMA Factory provides a 3.7 times speedup and achieves higher Rouge scores in advertising copy generation tasks. Combined with 4-bit quantization technology, LLaMA Factory’s QLoRA fine-tuning further reduces GPU memory consumption.
Variable Definition
- Training Speed: Number of samples processed per second during the training phase. (Batch size=4, truncation length=1024)
- Rouge Score: Rouge-2 score on the validation set of the advertising copy generation task. (Batch size=4, truncation length=1024)
- GPU Memory: Peak GPU memory for 4-bit quantization training. (Batch size=1, truncation length=1024)
- We use
pre_seq_len=128
in the P-Tuning of ChatGLM, andlora_rank=32
in the LoRA fine-tuning of LLaMA Factory.
Quick Start
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
Optional additional dependencies: torch, torch-npu, metrics, deepspeed, bitsandbytes, hqq, eetq, gptq, awq, aqlm, vllm, galore, badam, qwen, modelscope, quality
Tip
When encountering package conflicts, you can use pip install –no-deps -e . to resolve them.
Windows User Guide
If you want to enable Quantized LoRA (QLoRA) on the Windows platform, you need to install the precompiled bitsandbytes
library, which supports CUDA 11.1 to 12.2. Please choose the appropriate release version
according to your CUDA version.
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl
If you want to enable FlashAttention-2 on the Windows platform, you need to install the precompiled flash-attn
library, which supports CUDA 12.1 to 12.2. Please download and install the corresponding version according to your needs from flash-attention
.
Ascend NPU User Guide
When installing LLaMA Factory on Ascend NPU devices, you need to specify additional dependencies and use the command pip install -e ".[torch-npu,metrics]"
to install them. Additionally, you need to install the Ascend CANN Toolkit and Kernels
. Please refer to the installation tutorial
or use the following command:
# Please replace URL with the URL corresponding to the CANN version and device model
# Install CANN Toolkit
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run
bash Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run --install
# Install CANN Kernels
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run
bash Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run --install
# Set environment variables
source /usr/local/Ascend/ascend-toolkit/set_env.sh
Dependencies | Minimum | Recommended |
---|---|---|
CANN | 8.0.RC1 | 8.0.RC1 |
torch | 2.1.0 | 2.1.0 |
torch-npu | 2.1.0 | 2.1.0.post3 |
deepspeed | 0.13.2 | 0.13.2 |
Please use ASCEND_RT_VISIBLE_DEVICES
instead of CUDA_VISIBLE_DEVICES
to specify the computation device.
If you encounter a situation where reasoning cannot proceed normally, try setting do_sample: false
.
The following three commands perform LoRA fine-tuning, inference, and merging on the Llama3-8B-Instruct model.
llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
For more detailed information, please visit the Llama-Factory project homepage.
# Firefly
Firefly is an open-source large model training project that supports pre-training, instruction fine-tuning, and DPO for mainstream large models, including but not limited to Qwen2, Yi-1.5, Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. This project supports full parameter training, LoRA, QLoRA efficient training, and supports pre-training, SFT, DPO. If your training resources are limited, we strongly recommend using QLoRA for instruction fine-tuning, as we have validated the effectiveness of this method on the Open LLM Leaderboard and achieved very good results.
Main Features:
- 📗 Supports pre-training, instruction fine-tuning, DPO, full parameter training, LoRA, QLoRA efficient training. Train different models through configuration files, allowing beginners to quickly get started with model training.
- 📗 Supports using Unsloth to accelerate training and save video memory.
- 📗 Supports most mainstream open-source large models, such as Llama3, Gemma, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, aligning with the templates of each official chat model during training.
- 📗 Organize and open-source instruction fine-tuning datasets: firefly-train-1.1M, moss-003-sft-data, ultrachat, WizardLM_evol_instruct_V2_143k, school_math_0.25M.
- 📗 Open source Firefly series instruction fine-tuning model weights .
- 📗 Validated the effectiveness of the QLoRA training process on the Open LLM Leaderboard.
The README of the project contains detailed usage instructions, including how to install, how to train, how to fine-tune, and how to evaluate, etc. Please visit the Firefly project homepage.
# XTuner
XTuner is an efficient, flexible, and versatile lightweight large model fine-tuning tool library.
Main Features:
- Efficient
- Supports pre-training and lightweight fine-tuning of large language models (LLM) and multimodal image-text models (VLM). XTuner supports fine-tuning a 7B model with 8GB of video memory and also supports multi-node cross-device fine-tuning of larger scale models (70B+).
- Automatically distribute high-performance operators (such as FlashAttention, Triton kernels, etc.) to accelerate training throughput.
- Compatible with DeepSpeed 🚀, easily apply various ZeRO training optimization strategies.
- Flexible
- Supports pre-training and fine-tuning of the multimodal text-image model LLaVA. The model LLaVA-InternLM2-20B trained with XTuner performs excellently.
- Omnipotent
- Supports incremental pre-training, instruction fine-tuning, and Agent fine-tuning.
- Predefined numerous open-source dialogue templates, supporting conversation with open-source or trained models.
- The trained model can be seamlessly integrated with the deployment toolkit LMDeploy , the large-scale evaluation toolkit OpenCompass , and VLMEvalKit .
Quick Start: Can also integrate DeepSpeed installation:Installation
conda create --name xtuner-env python=3.10 -y
conda activate xtuner-env
pip install -U xtuner
pip install -U 'xtuner[deepspeed]'
git clone https://github.com/InternLM/xtuner.git
cd xtuner
pip install -e '.[all]'
Fine-tuning
XTuner supports fine-tuning large language models. For dataset preprocessing guidelines, please refer to the documentation .
- Step 0, prepare the configuration file. XTuner provides multiple out-of-the-box configuration files, which users can view with the following command:
xtuner list-cfg
Or, if the provided configuration file does not meet the usage requirements, please export the provided configuration file and make the corresponding changes:
xtuner copy-cfg ${CONFIG_NAME} ${SAVE_PATH}
vi ${SAVE_PATH}/${CONFIG_NAME}_copy.py
- Step 1, start fine-tuning.
xtuner train ${CONFIG_NAME_OR_PATH}
For example, we can use the QLoRA algorithm to fine-tune InternLM2.5-Chat-7B on the oasst1 dataset:
# Single card
xtuner train internlm2_5_chat_7b_qlora_oasst1_e3 --deepspeed deepspeed_zero2
# Multi-card
(DIST) NPROC_PER_NODE=${GPU_NUM} xtuner train internlm2_5_chat_7b_qlora_oasst1_e3 --deepspeed deepspeed_zero2
(SLURM) srun ${SRUN_ARGS} xtuner train internlm2_5_chat_7b_qlora_oasst1_e3 --launcher slurm --deepspeed deepspeed_zero2
--deepspeed
indicates using DeepSpeed 🚀 to optimize the training process. XTuner has multiple built-in strategies, including ZeRO-1, ZeRO-2, ZeRO-3, etc. If the user wishes to disable this feature, please remove this parameter directly.- For more examples, please refer to the documentation .
Step 2, convert the saved PTH model (if using DeepSpeed, it will be a folder) to a HuggingFace model:
xtuner convert pth_to_hf ${CONFIG_NAME_OR_PATH} ${PTH} ${SAVE_PATH}
For more detailed information, please visit the XTuner project homepage.
# Model quantization
LLM is usually large in volume and requires high computational resources. Model quantization techniques can compress model size, improve operational efficiency, and make it easier to deploy:
# AutoGPTQ
AutoGPTQ is a large language model quantization toolkit based on the GPTQ algorithm, simple to use and with a user-friendly interface.
Quick Installation
- For CUDA 11.7:
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/
- For CUDA 11.8:
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
- For RoCm 5.4.2:
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm542/
For more detailed information, please visit the project homepage of AutoGPTQ .
# AutoAWQ
AutoAWQ is another automated model quantization tool that supports multiple quantization precisions and offers flexible configuration options, allowing adjustments based on different hardware platforms and performance requirements.
AutoAWQ is an easy-to-use 4-bit quantization model package. Compared to FP16, AutoAWQ can increase model speed by 3 times and reduce memory requirements by 3 times. AutoAWQ implements the activation-aware weight quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved based on the original work AWQ from MIT.
Installation Method:
Before installing, ensure that CUDA >= 12.1 is installed (Note: The following is just the quickest installation method)
pip install autoawq
For more details and examples, please visit the project homepage of AutoAWQ .
# Neural Compressor
Neural Compressor is a model compression toolkit developed by Intel, supporting popular model compression techniques on all major deep learning frameworks (TensorFlow, PyTorch, ONNX Runtime, and MXNet).
Installation Method:
pip install "neural-compressor>=2.3" "transformers>=4.34.0" torch torchvision
For more detailed information and examples, please visit the project homepage of Neural Compressor .
# Model deployment
Deploying a trained LLM to a production environment is crucial. Here are some commonly used LLM deployment tools:
# vLLM
vLLM is a fast and easy-to-use LLM inference service library.
Main Features:
- Fast
- SOTA service throughput
- Efficiently manage attention key-value memory using PagedAttention
- Continuously batch process received requests
- Use CUDA/HIP graphs for acceleration
- Quantization: Supports GPTQ, AWQ, SqueezeLLM, FP8 KV cache
- Optimized CUDA kernel
- Flexible
- Seamless integration with popular Hugging Face models
- Provide high-throughput services using various decoding algorithms (including parallel sampling, beam search, etc.)
- Provide tensor parallel support for distributed inference
- Stream output
- Compatible with OpenAI’s application programming interface server
- Supports NVIDIA GPU, AMD GPU, Intel CPU, and GPU
- (Experimental) Support prefix caching
- (Experimental) Support for multiple languages
- Seamless Support
- Transformer-based models, such as Llama
- MoE-based model, such as Mixtral
- Multimodal models, such as LLaVA
Quick Installation:
pip install vllm
For more detailed information, please refer to the vLLM official documentation.
# SGL
SGLang is a structured generation language designed specifically for large language models (LLMs). By co-designing the front-end language and the runtime system, it makes your interactions with LLMs faster and more controllable.
Main Features:
- Flexible front-end language: Easily write LLM applications through chainable generation calls, advanced prompts, control flow, multiple modes, concurrency, and external interaction.
- High-performance backend runtime: Features RadixAttention capability, which can accelerate complex LLM programs by reusing KV cache across multiple calls. It can also function as a standalone inference engine, implementing all common techniques (such as continuous batching and tensor parallelism).
For more detailed information, please visit the SGL project homepage.
# SkyPilot
SkyPilot is a flexible cloud LLM deployment tool launched by UC Berkeley RISELab, supporting multiple cloud platforms and hardware accelerators. It can automatically select the optimal deployment plan and provide cost optimization features.
Main Features:
- Multi-cloud support: Supports various cloud platforms such as AWS, GCP, Azure, allowing users to choose the appropriate deployment environment.
- Easy to expand: Queue and run multiple jobs, automatic management
- Easy Access to Object Storage: Easily access object storage (S3, GCS, R2)
For more detailed information, please visit the SkyPilot project homepage.
# TensorRT-LLM
TensorRT-LLM is a high-performance LLM inference engine launched by NVIDIA, capable of fully utilizing GPU accelerated computation and optimized for the Transformer model architecture, significantly improving inference speed.
TensorRT-LLM provides users with an easy-to-use Python API for defining large language models (LLMs) and building TensorRT engines, which incorporate state-of-the-art optimization techniques for efficient inference execution on NVIDIA® graphics processors. TensorRT-LLM also includes components for creating Python and C++ runtimes that execute these TensorRT engines.
For more details, please visit the TensorRT-LLM project homepage.
# OpenVino
OpenVINO™ is an open-source toolkit for optimizing and deploying artificial intelligence inference.
Main Features:
- Inference Optimization: Enhance the performance of deep learning in computer vision, automatic speech recognition, generative AI, natural language processing using large and small language models, and many other common tasks.
- Flexible Model Support: Models trained with popular frameworks such as TensorFlow, PyTorch, ONNX, Keras, and PaddlePaddle. Convert and deploy models without the need for the original framework.
- Broad Platform Compatibility: Reduce resource requirements and efficiently deploy across a range of platforms from edge to cloud. OpenVINO™ supports inference on CPUs (x86, ARM), GPUs (integrated and discrete GPUs supporting OpenCL), and AI accelerators (Intel NPU).
- Community and Ecosystem: Join an active community contributing to improving deep learning performance in various fields.
For more information, please visit the project homepage of OpenVino .
# TGI
Text Generation Inference (TGI) is a toolkit for deploying and serving large language models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and others.
TGI has implemented many features, and detailed information can be found on the project homepage of TGI .
# Local run
Thanks to model compression and optimization techniques, we can also run LLM on personal devices:
# MLX
MLX is a framework specifically designed to support running LLM on Apple devices, fully utilizing Metal to accelerate computation, and providing easy-to-use APIs to facilitate developers in integrating LLM into iOS applications.
Main Features:
- Similar Application Programming Interface: MLX’s Python API is very similar to NumPy. MLX also has fully functional C++, C, and Swift APIs, which are very similar to the Python API. MLX has higher-level packages like
mlx.nn
andmlx.optimizers
, whose APIs are very close to PyTorch, simplifying the construction of more complex models. - Composable Function Transformations: MLX supports composable function transformations for automatic differentiation, automatic vectorization, and computation graph optimization.
- Lazy Evaluation: In MLX, computations only materialize arrays when needed.
- Dynamic Graph Construction: In MLX, the computational graph is dynamically constructed. Changing the shape of function parameters does not slow down the compilation speed, and debugging is simple and intuitive.
- Multi-device: Operations can run on any supported device (currently CPU and GPU).
- Unified Memory: The unified memory model is a significant difference between MLX and other frameworks. Arrays in MLX reside in shared memory. Operations on MLX arrays can be performed on any supported device type without the need to transfer data.
MLX is designed by machine learning researchers for machine learning researchers. The framework aims to be user-friendly while still efficiently training and deploying models. The design concept of the framework itself is also very simple. Our goal is to enable researchers to easily expand and improve MLX, thus quickly exploring new ideas. For more details, please visit the project homepage of MLX .
# Llama.cpp
Llama.cpp is a Llama model inference engine implemented in C++, capable of running efficiently on CPUs, and supports multiple operating systems and hardware platforms, allowing developers to run LLM on resource-constrained devices.
Main Features:
- CPU Inference: Optimized for CPU platforms, allowing LLM to run on devices without a GPU.
- Cross-platform support: Supports multiple operating systems such as Linux, macOS, Windows, making it convenient for users to use on different platforms.
- Lightweight Deployment: The compiled binary files are small, making it convenient for users to deploy and use.
For more detailed information, please visit the project homepage of Llama.cpp .
# Ollama
In the article 【Ollama: From Beginner to Advanced】 , it is introduced that Ollama is a tool for building large language model applications. It provides a simple and easy-to-use command line interface and server, allowing you to easily download, run, and manage various open-source LLMs. Unlike traditional LLMs that require complex configuration and powerful hardware, Ollama allows you to conveniently experience the powerful features of LLMs as easily as using a mobile app.
Main Features:
- Simple and Easy to Use: Ollama provides a simple and easy-to-use command line tool for users to download, run, and manage LLM.
- Multiple models: Ollama supports various open-source LLMs, including Qwen2, Llama3, Mistral, etc.
- Compatible with OpenAI Interface: Ollama supports the OpenAI API interface, making it easy to switch existing applications to Ollama.
For more details, please visit the project homepage of Ollama .
# Agent and RAG framework
Combining LLM with external data and tools can build more powerful applications. Here are some commonly used Agent and RAG frameworks:
# LlamaIndex
LlamaIndex (GPT Index) is a data framework for LLM applications. Building applications with LlamaIndex typically requires using LlamaIndex core and a selected set of integrations (or plugins). There are two ways to build applications with LlamaIndex in Python:
- Launcher:
llama-index
( https://pypi.org/project/llama-index/) . Python starter package, includes core LlamaIndex and some integrations. - Customization:
llama-index-core
( https://pypi.org/project/llama-index-core/) . Install the core LlamaIndex and add the necessary LlamaIndex integration packages for your application on LlamaHub. There are currently over 300 LlamaIndex integration packages that can seamlessly collaborate with the core, allowing you to build using your preferred LLM, embeddings, and vector storage databases.
The LlamaIndex Python library is named as such, so import statements containing core
mean that the core package is being used. Conversely, those statements without core
mean that the integration package is being used.
# typical pattern
from llama_index.core.xxx import ClassABC # core submodule xxx
from llama_index.xxx.yyy import (
SubclassABC,
) # integration yyy for submodule xxx
# concrete example
from llama_index.core.llms import LLM
from llama_index.llms.openai import OpenAI
# CrewAI
CrewAI is a framework for building AI Agents that can integrate LLM with other tools and APIs to accomplish more complex tasks, such as automating web operations, generating code, and more.
Main Features:
- Role-Based Agent Design: You can customize agents using specific roles, goals, and tools.
- Delegation between Autonomous Agents: Agents can autonomously delegate tasks to other agents and query information from each other, thereby improving problem-solving efficiency.
- Flexible task management: Customizable tools can be used to define tasks and dynamically assign tasks to agents.
- Process-Driven: The system is process-centered, currently supporting sequential task execution and hierarchical processes. In the future, it will also support more complex processes, such as negotiation and autonomous processes.
- Save output as file: Allows saving the output of a single task as a file for later use.
- Parse output to Pydantic or Json: It is possible to parse the output of a single task into a Pydantic model or Json format for easy subsequent processing and analysis.
- Support for Open Source Models: You can use OpenAI or other open source models to run your agent team. For more information on configuring agent and model connections, including how to connect to a locally running model, see Connecting crewAI to Large Language Models .
For more detailed information, please visit the project homepage of CrewAI .
# OpenDevin
OpenDevin is an autonomous software engineer platform powered by artificial intelligence and LLMs.
OpenDevin agents collaborate with human developers to write code, fix bugs, and release features.
For more information, please visit the project homepage of OpenDevin .
# Model evaluation
In order to select a suitable LLM and evaluate its performance, we need to conduct model evaluation:
# LMSys
LMSys Org is an open research organization founded by students and faculty from the University of California, Berkeley, in collaboration with the University of California, San Diego, and Carnegie Mellon University.
The goal is to make large models accessible to everyone by jointly developing open models, datasets, systems, and evaluation tools. Train large language models and make their applications widely available, while also developing distributed systems to accelerate their training and inference process.
Currently, the LMSys Chatbot Area is one of the most recognized large model rankings, acknowledged by many companies and research institutions.
Leaderboard address: https://arena.lmsys.org/
# OpenCompass
OpenCompass is an LLM evaluation platform that supports various models (Llama3, Mistral, InternLM2, GPT-4, LLaMa2, Qwen, GLM, Claude, etc.) on over 100 datasets.
# Open LLM Leaderboard
Open LLM Leaderboard is a continuously updated LLM ranking list that ranks different models based on multiple evaluation metrics, making it convenient for developers to understand the latest model performance and development trends.
Leaderboard address: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
# Summary
The LLM ecosystem is thriving, covering all aspects from model training to application implementation. With continuous technological advancements, it is believed that LLM will play a more important role in more fields, bringing us a more intelligent application experience.