Alibaba Cloud: Qwen Model Benchmarks

4

Models Tracked

Open

Weight Policy

86.1

Best Knowledge

86.6

Best Coding

About Alibaba Cloud (Qwen)

Alibaba Cloud’s Qwen team develops the Qwen series of large language models, one of the most widely adopted open-weight model families globally. Based in Hangzhou, China, the team has rapidly iterated from Qwen 1.5 through Qwen 2.5, with each generation bringing significant improvements in multilingual understanding, mathematical reasoning, and coding.

The Qwen 2.5 family includes models ranging from 0.5B to 72B parameters, all released under open licenses. QwQ-32B (2025) introduced dedicated reasoning capabilities, rivaling much larger closed-source models on mathematical and scientific benchmarks while remaining open-weight.

Qwen Model Timeline

The evolution of the Qwen model family. Verified scores from official Qwen team announcements.

March 2025

QwQ 32B

A dedicated reasoning model using reinforcement learning for extended chain-of-thought. QwQ-32B achieves 65.2% on GPQA Diamond despite being only 32B parameters, demonstrating that smaller reasoning-optimized models can compete with much larger general-purpose ones.

GPQA 65.2

September 2024

Qwen 2.5 72B

The flagship of the Qwen 2.5 family. Achieves 86.1% on MMLU, 49.0% on GPQA, 86.6% on HumanEval, and 84.1% on IFEval (strict prompt). One of the strongest open-weight models across all benchmarks.

MMLU 86.1 GPQA 49.0 HumanEval 86.6 IFEval 84.1

June 2024

Qwen 2 72B

A major architecture upgrade from Qwen 1.5, introducing improved tokenization, longer context windows, and significantly better multilingual understanding. Released as open weights across multiple size variants.

February 2024

Qwen 1.5 72B

Alibaba’s first widely competitive open-weight model. Qwen 1.5 introduced the family’s foundation of strong multilingual capabilities and demonstrated that open models from China could compete with Western counterparts.

Benchmark Performance

Qwen 2.5 72B scores across verified benchmark categories.

Key Strengths

Open-Weight Champion

The entire Qwen 2.5 family is released under open licenses, from 0.5B to 72B parameters, making frontier-class AI accessible to researchers and developers worldwide.

Strong All-Rounder

Qwen 2.5 72B achieves competitive scores across all categories: 86.1% MMLU, 86.6% HumanEval, 84.1% IFEval—one of the most balanced open models available.

Reasoning Specialization

QwQ-32B demonstrates that small, reasoning-focused models can achieve 65.2% on GPQA Diamond—competitive with models many times its size—through dedicated reinforcement learning training.

About This Data

Benchmark scores are sourced from official Qwen team blog posts and announcements. Qwen 2.5 72B scores use the Instruct variant for GPQA, HumanEval, and IFEval; the MMLU score (86.1%) is from the base model.

Explore More Providers

Compare Alibaba’s Qwen models against other frontier AI systems.

Previous: Mistral AI Back to Leaderboard Next: Cohere