1. 大模型推理学习资料#
- LLM推理优化技术详解 推理加速方法
- github:vLLM 高性能推理框架
- NVIDIA/FasterTransformer
- 分析transformer模型的参数量、计算量、中间激活、KV cache
大模型部署框架#
- [54.7k] vllm
- [16.8k] sglang
- [150k] ollama
- [84.5] llama.cpp
- [21.9] mlx
- [14.8k] ktransformers
- [0] lmstudio
2. 大模型推理优化#
大模型推理关注:延迟、吞吐和成本,优化分:改模型参数、单机优化、分布式优化。
- 改模型参数。
- 量化
- attention结构(mha、mqa、mla、sparse attention、 liner attention)
- ffn结构(moe)
- 其他结构(silu、rmsnorm)
- 随机解码。
- 单机优化。LLM是io约束的。
- 算子融合。qkv融合,bias融合。
- 高性能算子。flash attention、高性能矩阵运算gemm。需要深入到kernel层面。
- 内存管理。continuous batching、paged attention。
- 分布式优化。
- 模型并行。tensor并行、pipeline并行、专家并行
- 数据并行。zero3
- 硬件特化。prefill和generate分离。
2.1. 改模型参数#
2.1.1. 量化#
论文#
- [2025.01] Qrazor: Reliable and Effortless 4-bit LLM Quantization by Significant Data Razoring
- [2024.08] MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
- [2023.10] LLM-FP4: 4-Bit Floating-Point Quantized Transformers
- [2023.09] Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs AutoRound
- [2023.06] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration AWQ
- [2022.10] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers GPTQ
- [2022.08] LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
工具#
- [1.8k] vllm-project/llm-compressor
- [730] ModelCloud/GPTQModel
2.1.2. attention结构#
2.1.3. 并行解码#
- [2025.03] EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test 5.6倍加速
- [2024.01] Unlocking Efficiency in Large Language Model Inference:A Comprehensive Survey of Speculative Decoding 综述
- [2024.01] EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty 预估特征层
- [2024.01] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads 加多个解码头,用topk解码多个token,用tree attention判定是否采纳。
- [2023.10] Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy 用2D窗口维护多个ngram
- [2022.11] Fast Inference from Transformers via Speculative Decoding 小模型预估,大模型判定是否采纳。计算量不变,但是可以并行化了。
2.2. 单机优化#
2.2.1. attention#
- [2025.01] FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
- [2023.09] Efficient Memory Management for Large Language Model Serving with PagedAttention PagedAttention,虚拟内存技术,分页。比朴素batch快22倍吞吐,比ft快4倍。
- [2023.08] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills Chunk Prefills
- [2022.05] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness FlashAttention,提速2.4倍
2.2.2. FFN#
- [2023.06] [MoE: An Efficient Mixture of Experts for Large Language Models](URL_ADDRESS- [2023.06] MoE: An Efficient Mixture of Experts for Large Language Models
- GEMM 矩阵算子优化
- DeepGEMM FP8矩阵算子优化
- DeepEP 专家并行
2.3. 分布式优化#
- 3FS 分布式文件系统
- [2022.12] Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
- [2022.11] Efficiently Scaling Transformer Inference
- [2022.07] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
- 大模型推理序列并行
- 序列并行DeepSpeed-FPDT
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving PD分离