Qwen3.5-27B-NPU-Accuracy:用户可借助该项目获取 Qwen3.5-27B 模型在 NPU 环境下的精度表现及优化方案，核心提供基于 vLLM 引擎的推理配置、GPQA 数据集评测结果，通过 Prompt 增强和参数调优实现高精度推理。【此简介由AI生成】 - AtomGit AI社区

Qwen3.5-27B 精度评测报告

日期: 2026-04-04 模型: Qwen3.5-27B 评测工具: AISBench 推理引擎: vLLM (Ascend NPU) v0.18.RC1

注：生产环境配置最佳实践及落地指南，参考 https://ai.atomgit.com/Ascend-SACT/Qwen3.5-27B-NPU-Accuracy/blob/main/Qwen3.5-27B-%E7%94%9F%E4%BA%A7%E7%8E%AF%E5%A2%83%E6%8E%A8%E7%90%86%E9%85%8D%E7%BD%AE%E6%8C%87%E5%8D%97.md?init=initTree

一、评测结果汇总

评测编号	配置类型	GPQA Diamond 准确率	说明
评测 1	推理引导+max_out=16384 (8卡)	84.85%	Reason step by step, T=0.7
评测 2	推理引导+max_out=16384 (4卡)	84.34%	Reason step by step, T=0.7, TP=4
官方参考	-	85.5%	-

二、vLLM 服务启动命令

2.1 官方最佳配置（8卡，非thinking模式）

cd /mnt && \
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" && \
export HCCL_OP_EXPANSION_MODE="AIV" && \
export HCCL_BUFFSIZE=1024 && \
export OMP_NUM_THREADS=10 && \
export OMP_PROC_BIND=false && \

vllm serve /mnt/Qwen3.5-27B \
    --served-model-name "qwen3.5" \
    --host 0.0.0.0 \
    --port 8010 \
    --data-parallel-size 1 \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 128 \
    --gpu-memory-utilization 0.94 \
    --trust-remote-code \
    --async-scheduling \
    --allowed-local-media-path / \
    --mm-processor-cache-gb 0 \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
    --additional-config '{"enable_cpu_binding":true, "multistream_overlap_shared_expert": true}'

三、aisbench 评测命令

3.1 评测命令

ais_bench \
    --models qwen3_5_27b_chat \
    --datasets gpqa_gen_0_shot_str \
    -m all \
    -w /workspace/aisbench_eval_gpqa_no_thinking

四、模型配置文件

4.1 的模型配置（非thinking模式）

from ais_bench.benchmark.models import VLLMCustomAPIChat

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr='qwen3-5-27b-chat',
        path="",
        model="qwen3.5",
        request_rate = 0,
        retry = 2,
        host_ip = "127.0.0.1",
        host_port = 8010,
        max_out_len = 4096,
        batch_size=1,
        trust_remote_code=True,
        generation_kwargs = dict(
            temperature = 0.7,
            top_p = 0.8,
            top_k = 20,
            presence_penalty = 1.5,
            repetition_penalty = 1.0,
            seed = None,
            chat_template_kwargs = {"enable_thinking": False},
        ),
    )
]

五、详细评测结果

5.1 推理引导+max_out_len=16384

项目	值
工作目录	`/workspace/aisbench_eval_gpqa_eval6/20260404_162916`
数据集	GPQA_diamond
数据集版本	1a21c4
评测指标	accuracy
准确率	84.85%
失败请求数	0
运行时长	约 10 小时

关键优化：

Prompt增强: "Reason step by step, then state your final answer in the format..."
max_out_len: 8192 → 16384

5.2：推理引导+max_out=16384（4卡TP）

项目	值
工作目录	`/workspace/aisbench_eval_gpqa_eval7/20260404_230210`
数据集	GPQA_diamond
数据集版本	1a21c4
评测指标	accuracy
准确率	84.34%
失败请求数	0
运行时长	约 4 小时

关键配置：

tensor-parallel-size: 8 → 4
其他配置与评测6相同

5.3 两次评测的的深度改进分析

5.3.1 评测配置对比

配置项	评测5	评测6	变化
max_out_len	8192	16384	+100%
temperature	0.7	0.7	不变
top_p	0.8	0.8	不变
后处理器	extract_answer_multi_format	extract_answer_multi_format	不变
Prompt模板	"Format your response..."	"Reason step by step..."	核心改进

5.3.2 Prompt增强的深度分析

评测1的Prompt:

What is the correct answer to this question: {question}
Choices:
(A){A}
(B){B}
(C){C}
(D){D}
Format your response as follows: "The correct answer is (insert answer here)"

评测2的Prompt:

What is the correct answer to this question: {question}
Choices:
(A){A}
(B){B}
(C){C}
(D){D}
Reason step by step, then state your final answer in the format: "The correct answer is (X)"

改进效果:

强制推理过程: "Reason step by step"要求模型先进行推理再给出答案
减少快速猜测: 模型不再直接给出答案，而是先分析问题
提高答案质量: 通过推理过程，模型更有可能选择正确答案

5.3.3 max_out_len扩大的深度分析

配置	max_out_len	平均输出长度	截断样本数
评测5	8192	~2000字符	5个
评测6	16384	~11000字符	0个

改进效果:

解决输出截断: 5个原本因截断导致无法提取答案的样本得到正确处理
支持更长推理: 模型可以生成更详细的推理过程
平均输出增长5倍: 从~~2000字符增长到~~11000字符

5.3.4 准确率提升的根因分析

错误类型	评测5错误数	评测6错误数	改善
格式问题（截断）	5	0	+5正确
快速猜测错误	40	~15	+25正确
推理后仍错	6	~6	无改善
总计	49	~30	+19正确

核心发现:

评测6的长推理样本准确率高达90%
评测6的快速猜测样本准确率仅为68.8%
"Reason step by step" prompt使更多样本进入"长推理"模式

5.3.5 评测6相比评测5的效果总结

改进项	准确率提升	根因
Prompt增强	+7.5%	强制推理，减少快速猜测
max_out_len扩大	+2.1%	解决5个截断问题
总计	+9.6%	从75.25%提升至84.85%

六、结果分析

6.1 thinking模式问题

Thinking模式下的评测结果（2.02%和10.10%）远低于官方参考值（85.5%）。问题原因：

输出解析问题: extract_non_reasoning_content 后处理器无法正确从thinking内容中提取最终答案
reasoning-parser参数: 虽然评测2启用了 --reasoning-parser qwen3，但解析仍不完整
max_out_len限制: thinking模式需要更长的输出空间

6.2 非thinking模式优势

非thinking模式（70.71%）显著优于thinking模式，原因：

输出直接为答案，无需解析
采样参数更适合问答场景（temperature=0.7, top_p=0.8）
与官方参考值更接近（85.5% vs 70.71%，差距约15%）

七、结论与建议

7.1 结论

评测1准确率达84.85%，与官方参考值85.5%仅差0.65%
Prompt增强是关键：强制模型"Reason step by step"显著提升准确率（+9.6%）
增强后处理器 extract_answer_multi_format 是重要优化，提取率提升约20%
T=0.7 优于 T=1.0，确定性采样在高难度专业题上更有优势
thinking模式解析问题仍未解决，建议继续使用非thinking模式

7.2 推荐配置

generation_kwargs = dict(
    temperature = 0.7,
    top_p = 0.8,
    top_k = 20,
    presence_penalty = 1.5,
    repetition_penalty = 1.0,
)
max_out_len = 16384

Prompt模板:

What is the correct answer to this question: {question}
Choices:
(A){A}
(B){B}
(C){C}
(D){D}
Reason step by step, then state your final answer in the format: "The correct answer is (X)"

八、官方ModelScope参考配置

根据 /mnt/Qwen3.5-27B/README.md 中的官方文档：

8.1 标准部署（8卡）

vllm serve Qwen/Qwen3.5-27B \
    --port 8000 \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --reasoning-parser qwen3

8.2 官方推荐的thinking模式采样参数

temperature = 1.0
top_p = 0.95
top_k = 20
min_p = 0.0
presence_penalty = 1.5
repetition_penalty = 1.0

8.3 官方推荐的instruct（非thinking）模式采样参数

temperature = 0.7
top_p = 0.8
top_k = 20
min_p = 0.0
presence_penalty = 1.5
repetition_penalty = 1.0

Qwen3.5-27B 精度评测报告

日期: 2026-04-04 模型: Qwen3.5-27B 评测工具: AISBench 推理引擎: vLLM (Ascend NPU) v0.18.RC1

一、评测结果汇总

评测编号	配置类型	GPQA Diamond 准确率	说明
评测 1	推理引导+max_out=16384 (8卡)	84.85%	Reason step by step, T=0.7
评测 2	推理引导+max_out=16384 (4卡)	84.34%	Reason step by step, T=0.7, TP=4
官方参考	-	85.5%	-

二、vLLM 服务启动命令

2.1 官方最佳配置（8卡，非thinking模式）

cd /mnt && \
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" && \
export HCCL_OP_EXPANSION_MODE="AIV" && \
export HCCL_BUFFSIZE=1024 && \
export OMP_NUM_THREADS=10 && \
export OMP_PROC_BIND=false && \

vllm serve /mnt/Qwen3.5-27B \
    --served-model-name "qwen3.5" \
    --host 0.0.0.0 \
    --port 8010 \
    --data-parallel-size 1 \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 128 \
    --gpu-memory-utilization 0.94 \
    --trust-remote-code \
    --async-scheduling \
    --allowed-local-media-path / \
    --mm-processor-cache-gb 0 \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
    --additional-config '{"enable_cpu_binding":true, "multistream_overlap_shared_expert": true}'

三、aisbench 评测命令

3.1 评测命令

ais_bench \
    --models qwen3_5_27b_chat \
    --datasets gpqa_gen_0_shot_str \
    -m all \
    -w /workspace/aisbench_eval_gpqa_no_thinking

四、模型配置文件

4.1 的模型配置（非thinking模式）

from ais_bench.benchmark.models import VLLMCustomAPIChat

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr='qwen3-5-27b-chat',
        path="",
        model="qwen3.5",
        request_rate = 0,
        retry = 2,
        host_ip = "127.0.0.1",
        host_port = 8010,
        max_out_len = 4096,
        batch_size=1,
        trust_remote_code=True,
        generation_kwargs = dict(
            temperature = 0.7,
            top_p = 0.8,
            top_k = 20,
            presence_penalty = 1.5,
            repetition_penalty = 1.0,
            seed = None,
            chat_template_kwargs = {"enable_thinking": False},
        ),
    )
]

五、详细评测结果

5.1 推理引导+max_out_len=16384

项目	值
工作目录	`/workspace/aisbench_eval_gpqa_eval6/20260404_162916`
数据集	GPQA_diamond
数据集版本	1a21c4
评测指标	accuracy
准确率	84.85%
失败请求数	0
运行时长	约 10 小时

关键优化：

Prompt增强: "Reason step by step, then state your final answer in the format..."
max_out_len: 8192 → 16384

5.2：推理引导+max_out=16384（4卡TP）

项目	值
工作目录	`/workspace/aisbench_eval_gpqa_eval7/20260404_230210`
数据集	GPQA_diamond
数据集版本	1a21c4
评测指标	accuracy
准确率	84.34%
失败请求数	0
运行时长	约 4 小时

关键配置：

tensor-parallel-size: 8 → 4
其他配置与评测6相同

5.3 两次评测的的深度改进分析

5.3.1 评测配置对比

配置项	评测5	评测6	变化
max_out_len	8192	16384	+100%
temperature	0.7	0.7	不变
top_p	0.8	0.8	不变
后处理器	extract_answer_multi_format	extract_answer_multi_format	不变
Prompt模板	"Format your response..."	"Reason step by step..."	核心改进

5.3.2 Prompt增强的深度分析

评测1的Prompt:

What is the correct answer to this question: {question}
Choices:
(A){A}
(B){B}
(C){C}
(D){D}
Format your response as follows: "The correct answer is (insert answer here)"

评测2的Prompt:

What is the correct answer to this question: {question}
Choices:
(A){A}
(B){B}
(C){C}
(D){D}
Reason step by step, then state your final answer in the format: "The correct answer is (X)"

改进效果:

强制推理过程: "Reason step by step"要求模型先进行推理再给出答案
减少快速猜测: 模型不再直接给出答案，而是先分析问题
提高答案质量: 通过推理过程，模型更有可能选择正确答案

5.3.3 max_out_len扩大的深度分析

配置	max_out_len	平均输出长度	截断样本数
评测5	8192	~2000字符	5个
评测6	16384	~11000字符	0个

改进效果:

解决输出截断: 5个原本因截断导致无法提取答案的样本得到正确处理
支持更长推理: 模型可以生成更详细的推理过程
平均输出增长5倍: 从~~2000字符增长到~~11000字符

5.3.4 准确率提升的根因分析

错误类型	评测5错误数	评测6错误数	改善
格式问题（截断）	5	0	+5正确
快速猜测错误	40	~15	+25正确
推理后仍错	6	~6	无改善
总计	49	~30	+19正确

核心发现:

评测6的长推理样本准确率高达90%
评测6的快速猜测样本准确率仅为68.8%
"Reason step by step" prompt使更多样本进入"长推理"模式

5.3.5 评测6相比评测5的效果总结

改进项	准确率提升	根因
Prompt增强	+7.5%	强制推理，减少快速猜测
max_out_len扩大	+2.1%	解决5个截断问题
总计	+9.6%	从75.25%提升至84.85%

六、结果分析

6.1 thinking模式问题

Thinking模式下的评测结果（2.02%和10.10%）远低于官方参考值（85.5%）。问题原因：

输出解析问题: extract_non_reasoning_content 后处理器无法正确从thinking内容中提取最终答案
reasoning-parser参数: 虽然评测2启用了 --reasoning-parser qwen3，但解析仍不完整
max_out_len限制: thinking模式需要更长的输出空间

6.2 非thinking模式优势

非thinking模式（70.71%）显著优于thinking模式，原因：

输出直接为答案，无需解析
采样参数更适合问答场景（temperature=0.7, top_p=0.8）
与官方参考值更接近（85.5% vs 70.71%，差距约15%）

七、结论与建议

7.1 结论

评测1准确率达84.85%，与官方参考值85.5%仅差0.65%
Prompt增强是关键：强制模型"Reason step by step"显著提升准确率（+9.6%）
增强后处理器 extract_answer_multi_format 是重要优化，提取率提升约20%
T=0.7 优于 T=1.0，确定性采样在高难度专业题上更有优势
thinking模式解析问题仍未解决，建议继续使用非thinking模式

7.2 推荐配置

generation_kwargs = dict(
    temperature = 0.7,
    top_p = 0.8,
    top_k = 20,
    presence_penalty = 1.5,
    repetition_penalty = 1.0,
)
max_out_len = 16384

Prompt模板:

What is the correct answer to this question: {question}
Choices:
(A){A}
(B){B}
(C){C}
(D){D}
Reason step by step, then state your final answer in the format: "The correct answer is (X)"

八、官方ModelScope参考配置

根据 /mnt/Qwen3.5-27B/README.md 中的官方文档：

8.1 标准部署（8卡）

vllm serve Qwen/Qwen3.5-27B \
    --port 8000 \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --reasoning-parser qwen3

8.2 官方推荐的thinking模式采样参数

temperature = 1.0
top_p = 0.95
top_k = 20
min_p = 0.0
presence_penalty = 1.5
repetition_penalty = 1.0

8.3 官方推荐的instruct（非thinking）模式采样参数

temperature = 0.7
top_p = 0.8
top_k = 20
min_p = 0.0
presence_penalty = 1.5
repetition_penalty = 1.0