大模型笔记
  • Home

0.inbox

  • [Read]2025 04
  • [Read]2025 09
  • [Read]2026 01
  • [Read]2026 02
  • Alg interview faq

1.[基建]数据

  • Data

3.[基建]效率

  • Env
  • Inference
  • Train

4.[模型]文本

  • CodeLLM
  • Embedding
  • PostTraining
  • PreTraining

5.[模型]多模态

  • MultiModalEmbedding
  • T2V
  • VLA
  • VLM

6.[模型]评测

  • Benchmark
  • LMM Benchmark
  • Metric

7.[应用]产品

  • Agent
  • Context
  • Product
  • VibeCoding
大模型笔记
  • 1.[基建]数据
  • Data

理论#

  • [2026.01] From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence

数据清洗#

  • [2025.04] Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models 四维质量评估(PRRC); Meta-rater 方法训练多个代理小模型从多个维度打分,最后选出综合质量更高的数据。
  • [2024.02] Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation 用rm对qa对打分然后排序。pca降维,kmeans聚类。
  • [2023.12] What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning deita。complexity, quality, and diversity。用gpt来给指令和QA对打复杂度和质量分,用emb_sim来评估相似度。
    • 论文解读:如何自动选择SFT数据
    • LLM模型之高质量数据选择和微调方法
  • [2023.08] InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models instag

主动学习#

  • [2024.05] ActiveLLM: Large Language Model-based Active Learning for Textual Few-Shot Scenarios

  • ALiPy: Active Learning in Python

    • https://parnec.nuaa.edu.cn/huangsj/alipy/
  • baifanxxx/awesome-active-learning
  • modAL-python/modAL
Previous Next

Built with MkDocs using a theme provided by Read the Docs.
« Previous Next »