跳到内容

晚上好,辛苦一天了,放松一下吧。

AI 资源分类

大模型评测

共 11 条资源

当前分类链接 返回上级:大模型

LMArena

LMArena is an open-source platform developed by the SkyLab team at UC Berkeley that focuses on large-scale language model evaluation. Users can interact with more than 70 AI models and vote through anonymous matchmaking or direct comparison, and the platform generates real-time leaderboards based on the Elo rating system. The platform has collected more than 2.8 million community votes, providing a transparent and neutral model performance reference for researchers, developers and general users.

2026年4月15日 407 0 浏览 407,收藏 0

Open LLM Leaderboard

Open LLM Leaderboard is a standardized evaluation platform on Hugging Face for tracking, ranking, and comparing the performance of various types of open source big language models and chatbots. It serves researchers, developers and community users by providing transparent and reproducible evaluation results through unified benchmarks (e.g. MMLU, HellaSwag). The platform supports model submission, public access to data and community discussion, and although it has been officially retired in March 2025, its historical data and evaluation methods are still informative.

2026年4月15日 319 0 浏览 319,收藏 0

MMLU

The MMLU benchmarking page on the Papers with Code platform tracks the latest model performance rankings in the field of large-scale multi-task language understanding in real-time. The page displays the accuracy of models such as GPT, LLaMA, and others on 57 disciplinary tasks, provides links to papers with code, and is a core tool for researchers and developers to track cutting-edge advances in AI language understanding.

2026年4月15日 485 0 浏览 485,收藏 0

Anyscale

Anyscale is an AI platform created by the developers of the Ray framework, focused on running and scaling machine learning and artificial intelligence workloads. It provides fully managed services from data processing, model training to production inference, helping developers and enterprise teams seamlessly scale from laptops to distributed computing on thousands of nodes. The platform integrates cloud IDE, performance optimization, and cost governance tools for large-scale AI application deployments across multiple industries, including finance, technology, and media.

2026年4月15日 332 0 浏览 332,收藏 0

AGI-Eval Review Community

AGI-Eval is a big model evaluation community jointly launched by Shanghai Jiaotong University, Tongji University, East China Normal University and DataWhale. The platform provides authoritative model capability lists, rich evaluation datasets, human-computer collaborative competitions and Data Studio, aiming to measure the comprehensive performance of AI models in the dimensions of comprehension, reasoning, knowledge, etc. through a scientific and transparent evaluation system, and provide evaluation support for researchers and developers.

2026年4月15日 442 0 浏览 442,收藏 0

OpenCompass Sinan - Review List

OpenCompass LLM Leaderboard is an open source evaluation platform for Large Language Models, providing benchmark tests on over 100 datasets, covering dimensions such as knowledge, logic, math, and code. The list is updated in real-time to show the comprehensive performance ranking of open source and commercial models such as GPT-4, Claude, Qwen, etc., providing researchers and developers with an objective reference for model selection.

2026年4月15日 417 0 浏览 417,收藏 0

Prompt Llama

Prompt Llama is an online tool focused on text-to-image (AI painting) prompt word generation and model performance testing. It allows users to create high-quality prompts and compare the generation of different AI drawing models (e.g. AlbedoBase XL, AuraFlow) with the same prompt test. The platform is suitable for artists, designers, developers and researchers for creative inspiration, model evaluation and cue word optimization. The website is based in London and offers an intuitive interface and contact information.

2026年4月15日 377 0 浏览 377,收藏 0

Ai-Ceping

Ai-Ceping is a big language model assessment platform initiated by Professor Wang Haofen of Tongji University, guided by professors from several universities, and dedicated to providing authoritative, fair and transparent assessment data collection and analysis services.

2026年4月15日 392 0 浏览 392,收藏 0

C-Eval ranking

It is mainly used to show the comprehensive ability ranking of different Large Language Models (LLMs) in multi-level and multi-disciplinary Chinese language tasks.

2026年4月15日 442 0 浏览 442,收藏 0
正文
强调色