大模型评测#
常见评估集#
- chatbot-arena
- https://lmarena.ai/
- https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard
- 推理&知识
- 人类最后考试 https://agi.safe.ai/
- 视觉推理 https://mmmu-benchmark.github.io/
- 科学 https://github.com/idavidrein/gpqa
- 数学 https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions
- 代码
- 代码生成 https://livecodebench.github.io/
- 代码编辑 https://aider.chat/docs/leaderboards/
- Agent编程 https://www.swebench.com/
- 事实
- https://openai.com/index/introducing-simpleqa/ https://github.com/openai/simple-evals/
- 图像理解 https://github.com/reka-ai/reka-vibe-eval
- 长上下文
- 多轮一致性 https://arxiv.org/html/2409.12640v2
- 多语言
- https://huggingface.co/datasets/CohereForAI/Global-MMLU
-
open_llm_leaderboard
- SuperCLUE总排行榜[link]
- Text-to-Video Generation on MSR-VTT[link]
- Video Generation on UCF-101[link]
角色扮演#
- [2025.02] CoSER: Coordinating LLM-Based Persona Simulation of Established Roles 1.8w角色,3w对话
- [2024.08] MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents 多模态角色扮演评估,85 characters, 11K images, and 14K single or multi-turn dialogues
- [2024.01] Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment 4k个角色,3.6w的对话
- [2024.01] CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation 中文角色扮演评估,77个角色,1785个多轮对话
- [2023.12] RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models 中英双语角色评估,300个角色,6000个问题
- [2023.10] RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
- SuperCLUE-Role: 重新定义中文角色大模型测评基准