Inference - 大模型笔记

大模型笔记

1. 大模型推理学习资料#

2. 大模型推理优化#

大模型推理关注：延迟、吞吐和成本，优化分：改模型参数、单机优化、分布式优化。

改模型参数。
- 量化
- attention结构(mha、mqa、mla、sparse attention、 liner attention)
- ffn结构(moe)
- 其他结构(silu、rmsnorm)
- 随机解码。
单机优化。LLM是io约束的。
- 算子融合。qkv融合，bias融合。
- 高性能算子。flash attention、高性能矩阵运算gemm。需要深入到kernel层面。
- 内存管理。continuous batching、paged attention。
分布式优化。
- 模型并行。tensor并行、pipeline并行、专家并行
- 数据并行。zero3
- 硬件特化。prefill和generate分离。

2.1. 改模型参数#

2.1.1. 量化#

[2025.01] Qrazor: Reliable and Effortless 4-bit LLM Quantization by Significant Data Razoring
[2023.10] LLM-FP4: 4-Bit Floating-Point Quantized Transformers
[2023.09] Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs AutoRound
- intel/auto-round
[2023.06] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration AWQ
[2022.10] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers GPTQ
[2022.08] LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- LLM.int8(): 8-bit量化推理

2.1.2. attention结构#

FlashMLA

2.1.3. 并行解码#

[2024.01] EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- 论文解读】EAGLE：在特征层进行自回归的投机采样框架
[2024.01] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads 加多个解码头，用topk解码多个token，用tree attention判定是否采纳。
[2023.10] Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy 用2D窗口维护多个ngram
[2022.11] Fast Inference from Transformers via Speculative Decoding 小模型预估，大模型判定是否采纳。计算量不变，但是可以并行化了。

2.2. 单机优化#

2.2.1. attention#

[2023.09] Efficient Memory Management for Large Language Model Serving with PagedAttention PagedAttention,虚拟内存技术，分页。比朴素batch快22倍吞吐，比ft快4倍。
- vllm-project/vllm
- How continuous batching enables 23x throughput in LLM inference while reducing p50 latency
[2023.08] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills Chunk Prefills
- LLM推理优化 - Chunked prefills
[2022.05] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness FlashAttention，提速2.4倍

2.2.2. FFN#

[2023.06] [MoE: An Efficient Mixture of Experts for Large Language Models](URL_ADDRESS- [2023.06] MoE: An Efficient Mixture of Experts for Large Language Models
GEMM 矩阵算子优化
DeepGEMM FP8矩阵算子优化
DeepEP 专家并行

2.3. 分布式优化#

3FS 分布式文件系统
[2022.12] Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
[2022.11] Efficiently Scaling Transformer Inference
- 大模型并行推理的太祖长拳：解读Jeff Dean署名MLSys 23杰出论文
[2022.07] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
大模型推理序列并行
序列并行DeepSpeed-FPDT

« Previous Next »