大模型评测 - Sesame Pie AI

LMArena

LMArena is an open-source platform developed by the SkyLab team at UC Berkeley that focuses on large-scale language model evaluation. Users can interact with more than 70 AI models and vote through anonymous matchmaking or direct comparison, and the platform generates real-time leaderboards based on the Elo rating system. The platform has collected more than 2.8 million community votes, providing a transparent and neutral model performance reference for researchers, developers and general users.

AI模型评估 Elo排行榜 LMArena 大型语言模型

2026年4月15日 409 0

arize.com

The Arize AI platform focuses on the observability of AI and machine learning, helping teams monitor, debug and optimize AI models and large language models in production environments. It provides real-time monitoring, performance tracking, and LLM evaluation, and supports a wide range of model types and mainstream providers for a variety of industries, including finance, e-commerce, and autonomous driving.

AI可观测性 AI研究机构 Arize AI LLM评估

2026年4月15日 370 0

Open LLM Leaderboard

Open LLM Leaderboard is a standardized evaluation platform on Hugging Face for tracking, ranking, and comparing the performance of various types of open source big language models and chatbots. It serves researchers, developers and community users by providing transparent and reproducible evaluation results through unified benchmarks (e.g. MMLU, HellaSwag). The platform supports model submission, public access to data and community discussion, and although it has been officially retired in March 2025, its historical data and evaluation methods are still informative.

Hugging Face Open LLM Leaderboard 大模型评测开源大语言模型

2026年4月15日 319 0

MMLU

The MMLU benchmarking page on the Papers with Code platform tracks the latest model performance rankings in the field of large-scale multi-task language understanding in real-time. The page displays the accuracy of models such as GPT, LLaMA, and others on 57 disciplinary tasks, provides links to papers with code, and is a core tool for researchers and developers to track cutting-edge advances in AI language understanding.

MMLU基准测试 Papers with Code SOTA模型多任务语言理解

2026年4月15日 487 0

Anyscale

Anyscale is an AI platform created by the developers of the Ray framework, focused on running and scaling machine learning and artificial intelligence workloads. It provides fully managed services from data processing, model training to production inference, helping developers and enterprise teams seamlessly scale from laptops to distributed computing on thousands of nodes. The platform integrates cloud IDE, performance optimization, and cost governance tools for large-scale AI application deployments across multiple industries, including finance, technology, and media.

AI计算平台 Anyscale Ray框架分布式机器学习

2026年4月15日 335 0

AI Ping

AI Ping is a platform focusing on AI big model service performance evaluation, providing real-time and objective API performance data, including first token delay, overall delay, throughput and other key indicators. The platform covers mainstream model service providers and models at home and abroad, and supports chart comparison, data visualization and historical tracking to help developers, enterprise teams and researchers make model selection, performance monitoring and cost optimization decisions.

AI大模型评测 AI服务选型 API性能测试多模态大模型

2026年4月15日 511 0

AGI-Eval Review Community

AGI-Eval is a big model evaluation community jointly launched by Shanghai Jiaotong University, Tongji University, East China Normal University and DataWhale. The platform provides authoritative model capability lists, rich evaluation datasets, human-computer collaborative competitions and Data Studio, aiming to measure the comprehensive performance of AI models in the dimensions of comprehension, reasoning, knowledge, etc. through a scientific and transparent evaluation system, and provide evaluation support for researchers and developers.

AGI-Eval AI模型榜单人机协同大模型评测

2026年4月15日 444 0

OpenCompass Sinan - Review List

OpenCompass LLM Leaderboard is an open source evaluation platform for Large Language Models, providing benchmark tests on over 100 datasets, covering dimensions such as knowledge, logic, math, and code. The list is updated in real-time to show the comprehensive performance ranking of open source and commercial models such as GPT-4, Claude, Qwen, etc., providing researchers and developers with an objective reference for model selection.

LLM评估 OpenCompass 基准测试大模型评测

2026年4月15日 420 0

PinchBench

PinchBench is a professional evaluation platform for AI big model Agent capability developed by Kilo AI team, focusing on evaluating the actual task execution capability of big models under the OpenClaw framework. The platform quantitatively ranks mainstream models in three dimensions: success rate, speed and cost, provides real-time updated open source data, helps developers solve model selection problems, and is an important reference tool in the field of intelligent body development.

AI智能体评测 Kilo AI OpenClaw OpenClaw（龙虾）

2026年4月15日 327 0

Prompt Llama

Prompt Llama is an online tool focused on text-to-image (AI painting) prompt word generation and model performance testing. It allows users to create high-quality prompts and compare the generation of different AI drawing models (e.g. AlbedoBase XL, AuraFlow) with the same prompt test. The platform is suitable for artists, designers, developers and researchers for creative inspiration, model evaluation and cue word optimization. The website is based in London and offers an intuitive interface and contact information.

AI绘画提示词 Prompt Llama 大模型评测提示词优化

2026年4月15日 380 0