Benchmark

大模型评测#

chatbot-arena
- https://lmarena.ai/
- https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard
推理&知识
人类最后考试 https://agi.safe.ai/
视觉推理 https://mmmu-benchmark.github.io/
科学 https://github.com/idavidrein/gpqa
数学 https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions
代码
代码生成 https://livecodebench.github.io/
代码编辑 https://aider.chat/docs/leaderboards/
Agent编程 https://www.swebench.com/
事实
https://openai.com/index/introducing-simpleqa/ https://github.com/openai/simple-evals/
图像理解 https://github.com/reka-ai/reka-vibe-eval
长上下文
多轮一致性 https://arxiv.org/html/2409.12640v2
多语言
https://huggingface.co/datasets/CohereForAI/Global-MMLU
open_llm_leaderboard
SuperCLUE总排行榜[link]
Text-to-Video Generation on MSR-VTT[link]
Video Generation on UCF-101[link]