DeepSeek-V4-Flash-w8a8-mtp

1. 基本信息

项目	信息
原始模型名	DeepSeek-V4-Flash
原始模型链接	deepseek-ai/DeepSeek-V4-Flash
精度测试机型	Atlas 800T A2 1台
精度测试平台	docker vllm-ascend
版本	vllm-ascend:v0.13.0rc3
链接	quay.m.daocloud.io/ascend/vllm-ascend:v0.13.0rc3

2 量化脚本：

现已集成一键量化

msmodelslim quant \
 --model_path ${model_path} \
 --save_path ${save_path} \
 --model_type DeepSeek-V4-Flash \
 --quant_type w8a8 \
 --trust_remote_code True

3 精度测试结果

模型名	量化格式	数据集	测试精度 %	官方精度 %	备注
DeepSeek-V4-Flash-w8a8-mtp	w8a8	gpqa	71.21	71.2	V4-Flash 无思考
DeepSeek-V4-Flash-w8a8-mtp	w8a8	mmlupro	82.85	83.0	V4-Flash 无思考
DeepSeek-V4-Flash-w8a8-mtp	w8a8	mmlupro	85.86	86.2	V4-Flash 最大

使用ais_bench，其中Non-Think模式max_out_len = 65536，Max模式max_out_len = 131072。精度存在波动，建议多次测试。

4 思考模式开启方法

4.1 Curl指令：

无思考：不加思考参数

高："chat_template_kwargs": {"thinking": true, "reasoning_effort": "high"}

最大："chat_template_kwargs": {"thinking": true, "reasoning_effort": "max"}

4.2 Ais_bench 基准测试：

无思考：不加思考参数

高：

generation_kwargs=dict(
            ....
            chat_template_kwargs = {"thinking": True, "reasoning_effort": "high"}
        )

麦克斯：

generation_kwargs=dict(
            ....
            chat_template_kwargs = {"thinking": True, "reasoning_effort": "max"}
        )

DeepSeek-V4-Flash-w8a8-mtp

1. 基本信息

项目	信息
原始模型名	DeepSeek-V4-Flash
原始模型链接	deepseek-ai/DeepSeek-V4-Flash
精度测试机型	Atlas 800T A2 1台
精度测试平台	docker vllm-ascend
版本	vllm-ascend:v0.13.0rc3
链接	quay.m.daocloud.io/ascend/vllm-ascend:v0.13.0rc3

2 量化脚本：

现已集成一键量化

msmodelslim quant \
 --model_path ${model_path} \
 --save_path ${save_path} \
 --model_type DeepSeek-V4-Flash \
 --quant_type w8a8 \
 --trust_remote_code True

3 精度测试结果

模型名	量化格式	数据集	测试精度 %	官方精度 %	备注
DeepSeek-V4-Flash-w8a8-mtp	w8a8	gpqa	71.21	71.2	V4-Flash 无思考
DeepSeek-V4-Flash-w8a8-mtp	w8a8	mmlupro	82.85	83.0	V4-Flash 无思考
DeepSeek-V4-Flash-w8a8-mtp	w8a8	mmlupro	85.86	86.2	V4-Flash 最大

使用ais_bench，其中Non-Think模式max_out_len = 65536，Max模式max_out_len = 131072。精度存在波动，建议多次测试。

4 思考模式开启方法

4.1 Curl指令：

无思考：不加思考参数

高："chat_template_kwargs": {"thinking": true, "reasoning_effort": "high"}

最大："chat_template_kwargs": {"thinking": true, "reasoning_effort": "max"}

4.2 Ais_bench 基准测试：

无思考：不加思考参数

高：

generation_kwargs=dict(
            ....
            chat_template_kwargs = {"thinking": True, "reasoning_effort": "high"}
        )

麦克斯：

generation_kwargs=dict(
            ....
            chat_template_kwargs = {"thinking": True, "reasoning_effort": "max"}
        )