1. 大模型推理学习资料#
- LLM推理优化技术详解 推理加速方法
- github:vLLM 高性能推理框架
- NVIDIA/FasterTransformer
- 分析transformer模型的参数量、计算量、中间激活、KV cache
2. 大模型推理优化#
大模型推理关注:延迟、吞吐和成本,优化分:改模型参数、单机优化、分布式优化。
- 改模型参数。
- 量化
- attention结构(mha、mqa、mla、sparse attention、 liner attention)
- ffn结构(moe)
- 其他结构(silu、rmsnorm)
- 随机解码。
- 单机优化。LLM是io约束的。
- 算子融合。qkv融合,bias融合。
- 高性能算子。flash attention、高性能矩阵运算gemm。需要深入到kernel层面。
- 内存管理。continuous batching、paged attention。
- 分布式优化。
- 模型并行。tensor并行、pipeline并行、专家并行
- 数据并行。zero3
- 硬件特化。prefill和generate分离。
2.1. 改模型参数#
2.1.1. 量化#
- [2025.01] Qrazor: Reliable and Effortless 4-bit LLM Quantization by Significant Data Razoring
- [2023.10] LLM-FP4: 4-Bit Floating-Point Quantized Transformers
- [2023.09] Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs AutoRound
- [2023.06] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration AWQ
- [2022.10] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers GPTQ
- [2022.08] LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
2.1.2. attention结构#
2.1.3. 并行解码#
- [2024.01] EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- [2024.01] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads 加多个解码头,用topk解码多个token,用tree attention判定是否采纳。
- [2023.10] Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy 用2D窗口维护多个ngram
- [2022.11] Fast Inference from Transformers via Speculative Decoding 小模型预估,大模型判定是否采纳。计算量不变,但是可以并行化了。
2.2. 单机优化#
2.2.1. attention#
- [2023.09] Efficient Memory Management for Large Language Model Serving with PagedAttention PagedAttention,虚拟内存技术,分页。比朴素batch快22倍吞吐,比ft快4倍。
- [2023.08] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills Chunk Prefills
- [2022.05] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness FlashAttention,提速2.4倍
2.2.2. FFN#
- [2023.06] [MoE: An Efficient Mixture of Experts for Large Language Models](URL_ADDRESS- [2023.06] MoE: An Efficient Mixture of Experts for Large Language Models
- GEMM 矩阵算子优化
- DeepGEMM FP8矩阵算子优化
- DeepEP 专家并行