Intern-S2-Preview于2026年5月发布,并在vllm v0.21.1rc0版本合入相关PR,支持推理部署。
当前vllm_ascend的稳定版本为v0.18.0。且与verl配套进行强化学习时,适合选择vllm稳定版本。
本文将在vllm & vllm_ascend v0.18.0 版本上支持Intern-S2-Preview推理部署。
| 配套 | 版本 | 环境准备指导 |
|---|---|---|
| CANN | cann-8.5.1 | 快速安装CANN |
| torch | 2.9.0+cpu | - |
| torch_npu | 2.9.0.post1 | - |
| transformers | 5.8.1 | - |
| vllm | 0.18.0 | - |
| vllm-ascend | 0.18.0 | - |
8 * NPU(910B A2)
下文以/workspace作为源码安装的根目录
docker pull quay.io/ascend/vllm-ascend:v0.18.0通过命令,或通过服务平台加载镜像。
在vllm & vllm_ascend v0.18.0版本上支持Intern-S2-Preview_vllm18
参考:https://github.com/vllm-project/vllm/commit/faa4b76afae3660efc8a5d48a8c7c2d37e0ffb05
执行以下命令:
cd /vllm-workspace/vllm
git remote set-url origin https://gitcode.com/GitHub_Trending/vl/vllm.git
git fetch origin faa4b76afae3660efc8a5d48a8c7c2d37e0ffb05
git cherry-pick faa4b76afae3660efc8a5d48a8c7c2d37e0ffb05
git rm vllm/v1/spec_decode/llm_base_proposer.pycd /vllm-workspace/vllm
git apply <<EOF
diff --git a/vllm/model_executor/models/config.py b/vllm/model_executor/models/config.py
index 881963dbc..19c3c530f 100644
--- a/vllm/model_executor/models/config.py
+++ b/vllm/model_executor/models/config.py
@@ -673,6 +673,7 @@ MODELS_CONFIG_MAP: dict[str, type[VerifyAndUpdateConfig]] = {
"Qwen3VLForSequenceClassification": Qwen3VLForSequenceClassificationConfig,
"Qwen3_5ForConditionalGeneration": Qwen3_5ForConditionalGenerationConfig,
"Qwen3_5MoeForConditionalGeneration": Qwen3_5ForConditionalGenerationConfig,
+ "InternS2PreviewForConditionalGeneration": Qwen3_5ForConditionalGenerationConfig,
"VoyageQwen3BidirectionalEmbedModel": VoyageQwen3BidirectionalEmbedModelConfig,
"XLMRobertaModel": JinaRobertaModelConfig,
}
diff --git a/vllm/v1/spec_decode/eagle.py b/vllm/v1/spec_decode/eagle.py
index 445bb403b..15447043b 100644
--- a/vllm/v1/spec_decode/eagle.py
+++ b/vllm/v1/spec_decode/eagle.py
@@ -1275,6 +1275,7 @@ class SpecDecodeBaseProposer:
"GlmOcrForConditionalGeneration",
"Qwen3_5ForConditionalGeneration",
"Qwen3_5MoeForConditionalGeneration",
+ "InternS2PreviewForConditionalGeneration",
]:
self.model.config.image_token_index = target_model.config.image_token_id
elif self.get_model_name(target_model) == "PixtralForConditionalGeneration":
EOF执行以下命令:
cd /vllm-workspace/vllm-ascend/
git apply <<EOF
diff --git a/vllm_ascend/spec_decode/eagle_proposer.py b/vllm_ascend/spec_decode/eagle_proposer.py
index b4338345..4639342f 100644
--- a/vllm_ascend/spec_decode/eagle_proposer.py
+++ b/vllm_ascend/spec_decode/eagle_proposer.py
@@ -219,6 +219,7 @@ class SpecDecodeBaseProposer(EagleProposer):
"Qwen3VLMoeForConditionalGeneration",
"Qwen3_5ForConditionalGeneration",
"Qwen3_5MoeForConditionalGeneration",
+ "InternS2PreviewForConditionalGeneration",
]:
self.model.config.image_token_index = model.config.image_token_id
elif self.get_model_name(model) == "PixtralForConditionalGeneration":
EOFpip install transformers==5.8.1cd /workspace/models
modelscope download --model Shanghai_AI_Laboratory/Intern-S2-Preview --local_dir ./Intern-S2-Preview执行以下命令:
vllm serve ./Intern-S2-Preview \
--host 0.0.0.0 \
--port 8000 \
--served-model-name interns2_preview \
--trust-remote-code \
--data-parallel-size 4 \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--enable-prefix-caching \
--enable-chunked-prefill \
--async-scheduling \
--max-num-seqs 64 \
--compilation-config '{"cudagraph_capture_sizes":[1,4,8,12,16,24,32,48,56,64], "cudagraph_mode": "FULL_DECODE_ONLY"}' \
--speculative-config '{"method":"mtp","num_speculative_tokens":4}' \
--additional-config '{"enable_cpu_binding":true}' \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder执行以下命令:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "interns2_preview",
"messages": [
{"role": "system", "content": "You are an AI assistant."},
{"role": "user", "content": [{"type": "text", "text": "Please give a brief self-introduction."}]}
],
"temperature": 0.6,
"max_tokens": 256
}'执行以下命令:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "interns2_preview",
"messages": [
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "Describe the image."}
]}
],
"temperature": 0.6,
"max_tokens": 256
}'