日期: 2026-04-04 模型: Qwen3.5-27B 评测工具: AISBench 推理引擎: vLLM (Ascend NPU) v0.18.RC1
注:生产环境配置最佳实践及落地指南,参考 https://ai.atomgit.com/Ascend-SACT/Qwen3.5-27B-NPU-Accuracy/blob/main/Qwen3.5-27B-%E7%94%9F%E4%BA%A7%E7%8E%AF%E5%A2%83%E6%8E%A8%E7%90%86%E9%85%8D%E7%BD%AE%E6%8C%87%E5%8D%97.md?init=initTree
| 评测编号 | 配置类型 | GPQA Diamond 准确率 | 说明 |
|---|---|---|---|
| 评测 1 | 推理引导+max_out=16384 (8卡) | 84.85% | Reason step by step, T=0.7 |
| 评测 2 | 推理引导+max_out=16384 (4卡) | 84.34% | Reason step by step, T=0.7, TP=4 |
| 官方参考 | - | 85.5% | - |
cd /mnt && \
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" && \
export HCCL_OP_EXPANSION_MODE="AIV" && \
export HCCL_BUFFSIZE=1024 && \
export OMP_NUM_THREADS=10 && \
export OMP_PROC_BIND=false && \
vllm serve /mnt/Qwen3.5-27B \
--served-model-name "qwen3.5" \
--host 0.0.0.0 \
--port 8010 \
--data-parallel-size 1 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--max-num-batched-tokens 16384 \
--max-num-seqs 128 \
--gpu-memory-utilization 0.94 \
--trust-remote-code \
--async-scheduling \
--allowed-local-media-path / \
--mm-processor-cache-gb 0 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"enable_cpu_binding":true, "multistream_overlap_shared_expert": true}'ais_bench \
--models qwen3_5_27b_chat \
--datasets gpqa_gen_0_shot_str \
-m all \
-w /workspace/aisbench_eval_gpqa_no_thinkingfrom ais_bench.benchmark.models import VLLMCustomAPIChat
models = [
dict(
attr="service",
type=VLLMCustomAPIChat,
abbr='qwen3-5-27b-chat',
path="",
model="qwen3.5",
request_rate = 0,
retry = 2,
host_ip = "127.0.0.1",
host_port = 8010,
max_out_len = 4096,
batch_size=1,
trust_remote_code=True,
generation_kwargs = dict(
temperature = 0.7,
top_p = 0.8,
top_k = 20,
presence_penalty = 1.5,
repetition_penalty = 1.0,
seed = None,
chat_template_kwargs = {"enable_thinking": False},
),
)
]| 项目 | 值 |
|---|---|
| 工作目录 | /workspace/aisbench_eval_gpqa_eval6/20260404_162916 |
| 数据集 | GPQA_diamond |
| 数据集版本 | 1a21c4 |
| 评测指标 | accuracy |
| 准确率 | 84.85% |
| 失败请求数 | 0 |
| 运行时长 | 约 10 小时 |
关键优化:
"Reason step by step, then state your final answer in the format..."| 项目 | 值 |
|---|---|
| 工作目录 | /workspace/aisbench_eval_gpqa_eval7/20260404_230210 |
| 数据集 | GPQA_diamond |
| 数据集版本 | 1a21c4 |
| 评测指标 | accuracy |
| 准确率 | 84.34% |
| 失败请求数 | 0 |
| 运行时长 | 约 4 小时 |
关键配置:
| 配置项 | 评测5 | 评测6 | 变化 |
|---|---|---|---|
| max_out_len | 8192 | 16384 | +100% |
| temperature | 0.7 | 0.7 | 不变 |
| top_p | 0.8 | 0.8 | 不变 |
| 后处理器 | extract_answer_multi_format | extract_answer_multi_format | 不变 |
| Prompt模板 | "Format your response..." | "Reason step by step..." | 核心改进 |
评测1的Prompt:
What is the correct answer to this question: {question}
Choices:
(A){A}
(B){B}
(C){C}
(D){D}
Format your response as follows: "The correct answer is (insert answer here)"评测2的Prompt:
What is the correct answer to this question: {question}
Choices:
(A){A}
(B){B}
(C){C}
(D){D}
Reason step by step, then state your final answer in the format: "The correct answer is (X)"改进效果:
| 配置 | max_out_len | 平均输出长度 | 截断样本数 |
|---|---|---|---|
| 评测5 | 8192 | ~2000字符 | 5个 |
| 评测6 | 16384 | ~11000字符 | 0个 |
改进效果:
| 错误类型 | 评测5错误数 | 评测6错误数 | 改善 |
|---|---|---|---|
| 格式问题(截断) | 5 | 0 | +5正确 |
| 快速猜测错误 | 40 | ~15 | +25正确 |
| 推理后仍错 | 6 | ~6 | 无改善 |
| 总计 | 49 | ~30 | +19正确 |
核心发现:
| 改进项 | 准确率提升 | 根因 |
|---|---|---|
| Prompt增强 | +7.5% | 强制推理,减少快速猜测 |
| max_out_len扩大 | +2.1% | 解决5个截断问题 |
| 总计 | +9.6% | 从75.25%提升至84.85% |
Thinking模式下的评测结果(2.02%和10.10%)远低于官方参考值(85.5%)。问题原因:
extract_non_reasoning_content 后处理器无法正确从thinking内容中提取最终答案--reasoning-parser qwen3,但解析仍不完整非thinking模式(70.71%)显著优于thinking模式,原因:
generation_kwargs = dict(
temperature = 0.7,
top_p = 0.8,
top_k = 20,
presence_penalty = 1.5,
repetition_penalty = 1.0,
)
max_out_len = 16384Prompt模板:
What is the correct answer to this question: {question}
Choices:
(A){A}
(B){B}
(C){C}
(D){D}
Reason step by step, then state your final answer in the format: "The correct answer is (X)"根据 /mnt/Qwen3.5-27B/README.md 中的官方文档:
vllm serve Qwen/Qwen3.5-27B \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--reasoning-parser qwen3temperature = 1.0
top_p = 0.95
top_k = 20
min_p = 0.0
presence_penalty = 1.5
repetition_penalty = 1.0temperature = 0.7
top_p = 0.8
top_k = 20
min_p = 0.0
presence_penalty = 1.5
repetition_penalty = 1.0