Ascend-SACT/Intern-S2-Preview_vllm18
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

0. 说明

Intern-S2-Preview于2026年5月发布,并在vllm v0.21.1rc0版本合入相关PR,支持推理部署。

当前vllm_ascend的稳定版本为v0.18.0。且与verl配套进行强化学习时,适合选择vllm稳定版本。

本文将在vllm & vllm_ascend v0.18.0 版本上支持Intern-S2-Preview推理部署。

表 1 版本配套表

配套版本环境准备指导
CANNcann-8.5.1快速安装CANN
torch2.9.0+cpu-
torch_npu2.9.0.post1-
transformers5.8.1-
vllm0.18.0-
vllm-ascend0.18.0-

硬件设备

8 * NPU(910B A2)

安装目录

下文以/workspace作为源码安装的根目录

1. vllm&vllm_ascend

1.1. 下载vllm&vllm_ascend官方镜像

docker pull quay.io/ascend/vllm-ascend:v0.18.0

1.2. 启动镜像

通过命令,或通过服务平台加载镜像。

2. 更新镜像

在vllm & vllm_ascend v0.18.0版本上支持Intern-S2-Preview_vllm18

参考:https://github.com/vllm-project/vllm/commit/faa4b76afae3660efc8a5d48a8c7c2d37e0ffb05

2.1. 修改vllm代码

执行以下命令:

cd /vllm-workspace/vllm
git remote set-url origin https://gitcode.com/GitHub_Trending/vl/vllm.git
git fetch origin faa4b76afae3660efc8a5d48a8c7c2d37e0ffb05
git cherry-pick faa4b76afae3660efc8a5d48a8c7c2d37e0ffb05
git rm vllm/v1/spec_decode/llm_base_proposer.py
cd /vllm-workspace/vllm
git apply <<EOF
diff --git a/vllm/model_executor/models/config.py b/vllm/model_executor/models/config.py
index 881963dbc..19c3c530f 100644
--- a/vllm/model_executor/models/config.py
+++ b/vllm/model_executor/models/config.py
@@ -673,6 +673,7 @@ MODELS_CONFIG_MAP: dict[str, type[VerifyAndUpdateConfig]] = {
     "Qwen3VLForSequenceClassification": Qwen3VLForSequenceClassificationConfig,
     "Qwen3_5ForConditionalGeneration": Qwen3_5ForConditionalGenerationConfig,
     "Qwen3_5MoeForConditionalGeneration": Qwen3_5ForConditionalGenerationConfig,
+    "InternS2PreviewForConditionalGeneration": Qwen3_5ForConditionalGenerationConfig,
     "VoyageQwen3BidirectionalEmbedModel": VoyageQwen3BidirectionalEmbedModelConfig,
     "XLMRobertaModel": JinaRobertaModelConfig,
 }
diff --git a/vllm/v1/spec_decode/eagle.py b/vllm/v1/spec_decode/eagle.py
index 445bb403b..15447043b 100644
--- a/vllm/v1/spec_decode/eagle.py
+++ b/vllm/v1/spec_decode/eagle.py
@@ -1275,6 +1275,7 @@ class SpecDecodeBaseProposer:
                 "GlmOcrForConditionalGeneration",
                 "Qwen3_5ForConditionalGeneration",
                 "Qwen3_5MoeForConditionalGeneration",
+                "InternS2PreviewForConditionalGeneration",
             ]:
                 self.model.config.image_token_index = target_model.config.image_token_id
             elif self.get_model_name(target_model) == "PixtralForConditionalGeneration":
EOF

2.2. 修改vllm代码

执行以下命令:

cd /vllm-workspace/vllm-ascend/
git apply  <<EOF
diff --git a/vllm_ascend/spec_decode/eagle_proposer.py b/vllm_ascend/spec_decode/eagle_proposer.py
index b4338345..4639342f 100644
--- a/vllm_ascend/spec_decode/eagle_proposer.py
+++ b/vllm_ascend/spec_decode/eagle_proposer.py
@@ -219,6 +219,7 @@ class SpecDecodeBaseProposer(EagleProposer):
                 "Qwen3VLMoeForConditionalGeneration",
                 "Qwen3_5ForConditionalGeneration",
                 "Qwen3_5MoeForConditionalGeneration",
+                "InternS2PreviewForConditionalGeneration",
             ]:
                 self.model.config.image_token_index = model.config.image_token_id
             elif self.get_model_name(model) == "PixtralForConditionalGeneration":

EOF

2.3 更新transformers

pip install transformers==5.8.1

3. 部署推理

3.1. 权重下载

cd /workspace/models
modelscope download --model Shanghai_AI_Laboratory/Intern-S2-Preview --local_dir ./Intern-S2-Preview

3.2. 启动vllm

执行以下命令:

vllm serve ./Intern-S2-Preview \
--host 0.0.0.0 \
--port 8000 \
--served-model-name interns2_preview \
--trust-remote-code \
--data-parallel-size 4 \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--enable-prefix-caching \
--enable-chunked-prefill \
--async-scheduling \
--max-num-seqs 64 \
--compilation-config '{"cudagraph_capture_sizes":[1,4,8,12,16,24,32,48,56,64], "cudagraph_mode": "FULL_DECODE_ONLY"}' \
--speculative-config '{"method":"mtp","num_speculative_tokens":4}' \
--additional-config '{"enable_cpu_binding":true}' \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder

4. 验证

4.1. 文本

执行以下命令:

curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "interns2_preview",
"messages": [
        {"role": "system", "content": "You are an AI assistant."},
        {"role": "user", "content": [{"type": "text", "text": "Please give a brief self-introduction."}]}
    ],
"temperature": 0.6,
"max_tokens": 256
}'

4.2. 图片

执行以下命令:

curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "interns2_preview",
"messages": [
{"role": "user", "content": [
    {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
    {"type": "text", "text": "Describe the image."}
]}
],
"temperature": 0.6,
"max_tokens": 256
}'