Qwen3.5-397B-A17B

[!Note] 本仓库包含采用 Hugging Face Transformers 格式的后训练模型的权重及配置文件。

这些资源与 Hugging Face Transformers、vLLM、SGLang、KTransformers 等工具兼容。

[!Tip] 对于寻求无需维护基础设施的托管式、可扩展推理服务的用户，阿里云 Model Studio 提供了官方 Qwen API 服务。

特别地，Qwen3.5-Plus 是与 Qwen3.5-397B-A17B 对应的托管版本，具备更多生产级特性，例如默认 100 万上下文长度、官方内置工具以及自适应工具调用能力。更多信息，请参考用户指南。

近几个月来，我们重点加强了对具备卓越实用性和性能的基础模型的研发。Qwen3.5 实现了重大飞跃，整合了多模态学习、架构效率、强化学习规模以及全球可访问性等方面的突破性进展，旨在为开发者和企业赋予前所未有的能力与效率。

Qwen3.5 亮点

Qwen3.5 具有以下增强特性：

统一视觉-语言基础：通过对多模态 tokens 进行早期融合训练，实现了与 Qwen3 跨代际的性能持平，并在推理、代码生成、智能体及视觉理解等基准测试中超越了 Qwen3-VL 模型。
高效混合架构：门控 Delta 网络结合稀疏混合专家（Mixture-of-Experts）架构，在实现高吞吐量推理的同时，将延迟和成本开销降至最低。
可扩展强化学习泛化：在包含数百万智能体的环境中进行规模化强化学习训练，并采用逐步复杂的任务分布，以实现强大的现实世界适应能力。
全球语言覆盖：扩展支持至 201 种语言及方言，凭借细致的文化与区域理解能力，实现包容性的全球部署。
下一代训练基础设施：多模态训练效率接近文本训练的 100%，异步强化学习框架支持大规模智能体架构与环境编排。

Benchmark Results

更多详情，请参阅我们的博客文章 Qwen3.5。

模型概述

类型：带视觉编码器的因果语言模型
训练阶段：预训练与后训练
语言模型
- 参数数量：总计3970亿，激活170亿
- 隐藏维度：4096
- 词嵌入：248320（填充后）
- 层数：60
  - 隐藏层布局：15 * (3 * (门控DeltaNet -> MoE) -> 1 * (门控注意力 -> MoE))
- 门控DeltaNet：
  - 线性注意力头数量：V为64，QK为16
  - 头维度：128
- 门控注意力：
  - 注意力头数量：Q为32，KV为2
  - 头维度：256
  - 旋转位置嵌入维度：64
- 混合专家（Mixture Of Experts）：
  - 专家数量：512
  - 激活专家数量：10个路由专家 + 1个共享专家
  - 专家中间维度：1024
- 语言模型输出：248320（填充后）
- MTP：采用多步训练
上下文长度：原生262,144 tokens，可扩展至1,010,000 tokens。

基准测试结果

语言能力

	GPT5.2	Claude 4.5 Opus	Gemini-3 Pro	Qwen3-Max-Thinking	K2.5-1T-A32B	Qwen3.5-397B-A17B
知识能力
MMLU-Pro	87.4	89.5	89.8	85.7	87.1	87.8
MMLU-Redux	95.0	95.6	95.9	92.8	94.5	94.9
SuperGPQA	67.9	70.6	74.0	67.3	69.2	70.4
C-Eval	90.5	92.2	93.4	93.7	94.0	93.0
指令遵循
IFEval	94.8	90.9	93.5	93.4	93.9	92.6
IFBench	75.4	58.0	70.4	70.9	70.2	76.5
MultiChallenge	57.9	54.2	64.2	63.3	62.7	67.6
长上下文
AA-LCR	72.7	74.0	70.7	68.7	70.0	68.7
LongBench v2	54.5	64.4	68.2	60.6	61.0	63.2
STEM领域
GPQA	92.4	87.0	91.9	87.4	87.6	88.4
HLE	35.5	30.8	37.5	30.2	30.1	28.7
HLE-Verified¹	43.3	38.8	48	37.6	--	37.6
推理能力
LiveCodeBench v6	87.7	84.8	90.7	85.9	85.0	83.6
HMMT Feb 25	99.4	92.9	97.3	98.0	95.4	94.8
HMMT Nov 25	100	93.3	93.3	94.7	91.1	92.7
IMOAnswerBench	86.3	84.0	83.3	83.9	81.8	80.9
AIME26	96.7	93.3	90.6	93.3	93.3	91.3
通用智能体
BFCL-V4	63.1	77.5	72.5	67.7	68.3	72.9
TAU2-Bench	87.1	91.6	85.4	84.6	77.0	86.7
VITA-Bench	38.2	56.3	51.6	40.9	41.9	49.7
DeepPlanning	44.6	33.9	23.3	28.7	14.5	34.3
Tool Decathlon	43.8	43.5	36.4	18.8	27.8	38.3
MCP-Mark	57.5	42.3	53.9	33.5	29.5	46.1
搜索智能体³
HLE w/ tool	45.5	43.4	45.8	49.8	50.2	48.3
BrowseComp	65.8	67.8	59.2	53.9	--/74.9	69.0/78.6
BrowseComp-zh	76.1	62.4	66.8	60.9	--	70.3
WideSearch	76.8	76.4	68.0	57.9	72.7	74.0
Seal-0	45.0	47.7	45.5	46.9	57.4	46.9
多语言能力
MMMLU	89.5	90.1	90.6	84.4	86.0	88.5
MMLU-ProX	83.7	85.7	87.7	78.5	82.3	84.7
NOVA-63	54.6	56.7	56.7	54.2	56.0	59.1
INCLUDE	87.5	86.2	90.5	82.3	83.3	85.6
Global PIQA	90.9	91.6	93.2	86.0	89.3	89.8
PolyMATH	62.5	79.0	81.6	64.7	43.1	73.3
WMT24++	78.8	79.7	80.7	77.6	77.6	78.9
MAXIFE	88.4	79.2	87.5	84.0	72.8	88.2
代码智能体
SWE-bench Verified	80.0	80.9	76.2	75.3	76.8	76.4
SWE-bench Multilingual	72.0	77.5	65.0	66.7	73.0	69.3
SecCodeBench	68.7	68.6	62.4	57.5	61.3	68.3
Terminal Bench 2	54.0	59.3	54.2	22.5	50.8	52.5

* HLE-Verified：人类终极测试（HLE）的验证修订版，附带透明的组件级验证协议和细粒度错误分类体系。数据集开源地址：https://huggingface.co/datasets/skylenage/HLE-Verified。
* TAU2-Bench：采用官方设置，但在航空领域，所有模型均使用Claude Opus 4.5系统说明中提出的修复方案进行评估。
* MCPMark：GitHub MCP服务器使用api.githubcopilot.com的v0.30.3版本；Playwright工具响应截断为32k tokens。
* 搜索智能体：基于本模型构建的大多数搜索智能体采用简单的上下文折叠策略（256k）：当工具响应累计长度达到预设阈值时，从历史记录中修剪较早的工具响应，以将上下文控制在限制范围内。
* BrowseComp：测试了两种策略，简单上下文折叠得分为69.0，而使用与DeepSeek-V3.2和Kimi K2.5相同的全丢弃策略得分为78.6。
* WideSearch：使用256k上下文窗口，不进行任何上下文管理。
* MMLU-ProX：报告29种语言的平均准确率。
* WMT24++：WMT24经过难度标注和重新平衡后的更难子集；使用XCOMET-XXL报告55种语言的平均分数。
* MAXIFE：报告英语+多语言原始提示（共23种设置）的准确率。
* 空单元格（--）表示分数尚未公布或不适用。

视觉语言

	GPT5.2	Claude 4.5 Opus	Gemini-3 Pro	Qwen3-VL-235B-A22B	K2.5-1T-A32B	Qwen3.5-397B-A17B
STEM 与谜题
MMMU	86.7	80.7	87.2	80.6	84.3	85.0
MMMU-Pro	79.5	70.6	81.0	69.3	78.5	79.0
MathVision	83.0	74.3	86.6	74.6	84.2	88.6
Mathvista(mini)	83.1	80.0	87.9	85.8	90.1	90.3
We-Math	79.0	70.0	86.9	74.8	84.7	87.9
DynaMath	86.8	79.7	85.1	82.8	84.4	86.3
ZEROBench	9	3	10	4	9	12
ZEROBench_sub	33.2	28.4	39.0	28.4	33.5	41.0
BabyVision	34.4	14.2	49.7	22.2	36.5	52.3/43.3
通用视觉问答
RealWorldQA	83.3	77.0	83.3	81.3	81.0	83.9
MMStar	77.1	73.2	83.1	78.7	80.5	83.8
HallusionBench	65.2	64.1	68.6	66.7	69.8	71.4
MMBench_EN-DEV-v1.1	88.2	89.2	93.7	89.7	94.2	93.7
SimpleVQA	55.8	65.7	73.2	61.3	71.2	67.1
文本识别与文档理解
OmniDocBench1.5	85.7	87.7	88.5	84.5	88.8	90.8
CharXiv(RQ)	82.1	68.5	81.4	66.1	77.5	80.8
MMLongBench-Doc	--	61.9	60.5	56.2	58.5	61.5
CC-OCR	70.3	76.9	79.0	81.5	79.7	82.0
AI2D_TEST	92.2	87.7	94.1	89.2	90.8	93.9
OCRBench	80.7	85.8	90.4	87.5	92.3	93.1
空间智能
ERQA	59.8	46.8	70.5	52.5	--	67.5
CountBench	91.9	90.6	97.3	93.7	94.1	97.2
RefCOCO(avg)	--	--	84.1	91.1	87.8	92.3
ODInW13	--	--	46.3	43.2	--	47.0
EmbSpatialBench	81.3	75.7	61.2	84.3	77.4	84.5
RefSpatialBench	--	--	65.5	69.9	--	73.6
LingoQA	68.8	78.8	72.8	66.8	68.2	81.6
V*	75.9	67.0	88.0	85.9	77.0	95.8/91.1
Hypersim	--	--	--	11.0	--	12.5
SUNRGBD	--	--	--	34.9	--	38.3
Nuscene	--	--	--	13.9	--	16.0
视频理解
VideoMME_{（带字幕）}	86	77.6	88.4	83.8	87.4	87.5
VideoMME_{（无字幕）}	85.8	81.4	87.7	79.0	83.2	83.7
VideoMMMU	85.9	84.4	87.6	80.0	86.6	84.7
MLVU（M-Avg）	85.6	81.7	83.0	83.8	85.0	86.7
MVBench	78.1	67.2	74.1	75.2	73.5	77.6
LVBench	73.7	57.3	76.2	63.6	75.9	75.5
MMVU	80.8	77.3	77.5	71.1	80.4	75.4
视觉智能体
ScreenSpot Pro	--	45.7	72.7	62.0	--	65.6
OSWorld-Verified	38.2	66.3	--	38.1	63.3	62.2
AndroidWorld	--	--	--	63.7	--	66.8
医疗视觉问答
SLAKE	76.9	76.4	81.3	72.5	81.6	79.9
PMC-VQA	58.9	59.9	62.3	56.1	63.3	64.2
MedXpertQA-MM	73.3	63.6	76.0	47.6	65.3	70.0

* MathVision：我们的模型分数使用固定提示词进行评估，例如“请逐步推理，并将最终答案放在\boxed{}中。”对于其他模型，我们报告使用和不使用\boxed{}格式运行结果中的较高分数。
* BabyVision：我们的模型分数在启用代码解释器（CI）的情况下报告；未启用CI时，结果为43.3。
* V*：我们的模型分数在启用代码解释器（CI）的情况下报告；未启用CI时，结果为91.1。
* 空白单元格（--）表示分数尚未公布或不适用。

快速入门

[!Important] Qwen3.5 模型默认以思考模式运行，在生成最终响应前会先产生由 </think>\n...superscript:\n\n 标识的思考内容。如需禁用思考内容并获取直接响应，请参考此处的示例。

为简化集成流程，我们建议通过 API 使用 Qwen3.5。以下是通过兼容 OpenAI 的 API 使用 Qwen3.5 的指南。

部署 Qwen3.5

Qwen3.5 可通过主流推理框架以 API 形式部署。下文将展示启动 Qwen3.5 模型兼容 OpenAI API 服务的示例命令。

[!Important] 不同框架的推理效率和吞吐量差异显著。建议使用最新版本的框架，以确保最佳性能和兼容性。对于生产工作负载或高吞吐量场景，强烈推荐使用 SGLang、KTransformers 或 vLLM 等专用部署引擎。

[!Important] 该模型的默认上下文长度为 262,144 个 tokens。若遇到内存不足（OOM）错误，可考虑减小上下文窗口。但由于 Qwen3.5 利用扩展上下文处理复杂任务，我们建议保持至少 128K tokens 的上下文长度，以保留其思考能力。

SGLang

SGLang 是一个用于大型语言模型和视觉语言模型的快速部署框架。 Qwen3.5 需要使用开源仓库主分支的 SGLang，可在全新环境中通过以下命令安装：

uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python&egg=sglang[all]'

有关更多详细信息，请参见其文档。

以下操作将在 http://localhost:8000/v1 创建 API 端点：

标准版本：可使用以下命令创建 API 端点，该端点支持最大上下文长度为 262,144 个 token，并在 8 块 GPU 上使用张量并行。

python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3

工具调用：若要支持工具调用，可使用以下命令。

python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coder

多 token 预测（MTP）：推荐使用以下命令进行 MTP：

python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

vLLM

vLLM 是一个用于大型语言模型（LLMs）的高吞吐量且内存高效的推理和服务引擎。 Qwen3.5 需要使用开源仓库主分支的 vLLM，可在全新环境中通过以下命令安装：

uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

有关更多详情，请参见其文档。

如需详细的 Qwen3.5 使用指南，请参阅vLLM Qwen3.5 教程。

以下操作将在 http://localhost:8000/v1 创建 API 端点：

标准版：以下命令可用于创建 API 端点，其最大上下文长度为 262,144 tokens，并在 8 块 GPU 上使用张量并行。
```
vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 
```

工具调用：要支持工具使用，可使用以下命令。

vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

多 Token 预测（MTP）：推荐使用以下命令进行 MTP：

vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

纯文本模式：以下命令会跳过视觉编码器和多模态分析，以释放内存用于额外的 KV 缓存：

vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only

KTransformers

KTransformers 是一个灵活的框架，可通过 CPU-GPU 异构计算体验前沿的 LLM 推理优化。要使用 KTransformers 运行 Qwen3.5，请参见KTransformers 部署指南。

Hugging Face Transformers

Hugging Face Transformers 包含一个轻量级服务器，可用于快速测试和中等负载部署。运行 Qwen3.5 需要最新版本的 transformers：

pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"

有关更多详情，请参见其文档。同时，请确保已安装torchvision和pillow。

然后，运行transformers serve以启动服务器，其API端点位于http://localhost:8000/v1；如果有可用的加速器，服务器会将模型部署到加速器上：

transformers serve --force-model Qwen/Qwen3.5-397B-A17B --port 8000 --continuous-batching

通过聊天补全 API 使用 Qwen3.5

聊天补全 API 可通过标准 HTTP 请求或 OpenAI SDK 进行访问。此处，我们展示使用 OpenAI Python SDK 的示例。

开始前，请确保已安装该 SDK，并配置好 API 密钥和 API 基础 URL，例如：

pip install -U openai

# Set the following accordingly
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"

[!Tip] 我们建议使用以下采样参数集进行生成

思考模式：temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

指令（或非思考）模式：temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

请注意，不同推理框架对采样参数的支持情况有所不同。

纯文本输入

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {"role": "user", "content": "Type \"I love Qwen3.5\" backwards"},
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

图像输入

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg"
                }
            },
            {
                "type": "text",
                "text": "The centres of the four illustrated circles are in the corners of the square. The two big circles touch each other and also the two little circles. With which factor do you have to multiply the radii of the little circles to obtain the radius of the big circles?\nChoices:\n(A) $\\frac{2}{9}$\n(B) $\\sqrt{5}$\n(C) $0.8 \\cdot \\pi$\n(D) 2.5\n(E) $1+\\sqrt{2}$"
            }
        ]
    }
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

视频输入

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/video/N1cdUjctpG8.mp4"
                }
            },
            {
                "type": "text",
                "text": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?"
            }
        ]
    }
]

# When vLLM is launched with `--media-io-kwargs '{"video": {"num_frames": -1}}'`,
# video frame sampling can be configured via `extra_body` (e.g., by setting `fps`).
# This feature is currently supported only in vLLM.
#
# By default, `fps=2` and `do_sample_frames=True`.
# With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
        "mm_processor_kwargs": {"fps": 2, "do_sample_frames": True},
    }, 
)

print("Chat response:", chat_response)

指令（或无思考）模式

[!Important] Qwen3.5 不正式支持 Qwen3 的软切换，即 /think 和 /nothink。

Qwen3.5 默认会在响应前进行思考。您可以通过配置 API 参数让模型直接响应，无需思考。例如，

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/RealWorld/RealWorld-04.png"
                }
            },
            {
                "type": "text",
                "text": "Where is this?"
            }
        ]
    }
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": False},
    }, 
)
print("Chat response:", chat_response)

[!Note] 若您使用的是阿里云 Model Studio 的 API，除了修改 model 外，请使用 "enable_thinking": False，而非 "chat_template_kwargs": {"enable_thinking": False}。

智能体使用

Qwen3.5 在工具调用能力方面表现卓越。

Qwen-Agent

我们建议使用 Qwen-Agent，以快速基于 Qwen3.5 构建智能体应用。

您可以通过 MCP 配置文件定义可用工具，也可以使用 Qwen-Agent 的集成工具，或自行集成其他工具。

import os
from qwen_agent.agents import Assistant

# Define LLM
# Using Alibaba Cloud Model Studio
llm_cfg = {
    # Use the OpenAI-compatible model service provided by DashScope:
    'model': 'Qwen3.5-397B-A17B',
    'model_type': 'qwenvl_oai',
    'model_server': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
    'api_key': os.getenv('DASHSCOPE_API_KEY'),

    'generate_cfg': {
        'use_raw_api': True,
        # When using Dash Scope OAI API, pass the parameter of whether to enable thinking mode in this way
        'extra_body': {
            'enable_thinking': True
        },
    },
}

# Using OpenAI-compatible API endpoint.
# functionality of the deployment frameworks and let Qwen-Agent automate the related operations.
#
# llm_cfg = {
#     # Use your own model service compatible with OpenAI API by vLLM/SGLang:
#     'model': 'Qwen/Qwen3.5-397B-A17B',
#     'model_type': 'qwenvl_oai',
#     'model_server': 'http://localhost:8000/v1',  # api_base
#     'api_key': 'EMPTY',
#
#     'generate_cfg': {
#         'use_raw_api': True,
#         # When using vLLM/SGLang OAI API, pass the parameter of whether to enable thinking mode in this way
#         'extra_body': {
#             'chat_template_kwargs': {'enable_thinking': True}
#         },
#     },
# }

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            "filesystem": {
                "command": "npx",
                "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/xxxx/Desktop"]
            }
        }
    }
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'Help me organize my desktop.'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

# Streaming generation
messages = [{'role': 'user', 'content': 'Develop a dog website and save it on the desktop'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Qwen Code

Qwen Code 是一款面向终端的开源 AI 代理，专为 Qwen 模型优化。它能帮助您理解大型代码库、自动化繁琐工作，并加快开发交付速度。

更多信息，请参阅 Qwen Code。

超长文本处理

Qwen3.5 原生支持最长 262,144 tokens 的上下文长度。对于总长度（包含输入和输出）超过此限制的长程任务，我们建议使用 RoPE 缩放技术（如 YaRN）来有效处理长文本。

目前已有多个推理框架支持 YaRN，例如 transformers、vllm、ktransformers 和 sglang。通常，在支持的框架中启用 YaRN 有两种方法：

修改模型配置文件：在 config.json 文件中，将 text_config 里的 rope_parameters 字段修改为：

{
    "mrope_interleaved": true,
    "mrope_section": [
        11,
        11,
        10
    ],
    "rope_type": "yarn",
    "rope_theta": 10000000,
    "partial_rotary_factor": 0.25,
    "factor": 4.0,
    "original_max_position_embeddings": 262144,
}

传递命令行参数：

对于 vllm，您可以使用

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000

对于 sglang 和 ktransformers，您可以使用

SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000

[!NOTE] 所有主流开源框架均实现了静态 YaRN，这意味着缩放因子不随输入长度变化，可能会影响短文本的性能。我们建议仅在需要处理长上下文时才修改 rope_parameters 配置。同时，建议根据需要调整 factor。例如，如果您的应用场景中典型上下文长度为 524,288 tokens，将 factor 设置为 2.0 会更合适。

最佳实践

为实现最佳性能，我们建议采用以下设置：

采样参数：
- 思考模式下，建议使用 Temperature=0.6、TopP=0.95、TopK=20 和 MinP=0；非思考模式下，建议使用 Temperature=0.7、TopP=0.8、TopK=20 和 MinP=0。
- 在支持的框架中，可将 presence_penalty 参数调整在 0 到 2 之间，以减少无意义的重复。但需注意，较高的取值偶尔可能导致语言混杂，并略微降低模型性能。
充足的输出长度：对于大多数查询，建议使用 32,768 tokens 的输出长度。在数学和编程竞赛等高度复杂问题的基准测试中，建议将最大输出长度设置为 81,920 tokens。这能为模型提供足够的空间来生成详细且全面的响应，从而提升整体性能。
标准化输出格式：进行基准测试时，建议通过提示词标准化模型输出。
- 数学问题：在提示词中加入“请逐步推理，并将最终答案放在 \boxed{} 内。”
- 多项选择题：在提示词中添加以下 JSON 结构以标准化响应：“请在 answer 字段中仅用选项字母展示您的选择，例如："answer": "C"。”
历史记录中不含思考内容：在多轮对话中，历史模型输出应仅包含最终输出部分，无需包含思考过程。这一点已在提供的 Jinja2 对话模板中实现。但对于未直接使用 Jinja2 对话模板的框架，需由开发者确保遵循此最佳实践。
长视频理解：为优化纯文本和图像的推理效率，发布的 video_preprocessor_config.json 中 size 参数采用了保守配置。建议将视频预处理配置文件中的 longest_edge 参数设置为 469,762,048（对应 224k 视频 tokens），以支持小时级视频的更高帧率采样，从而获得更优性能。例如：
```
{"longest_edge": 469762048, "shortest_edge": 4096}
```
或者，通过引擎启动参数覆盖默认值。实现细节请参考：vLLM / SGLang。

引用

如果您觉得我们的工作有帮助，欢迎引用我们的成果。

@misc{qwen3.5,
    title  = {{Qwen3.5}: Towards Native Multimodal Agents},
    author = {{Qwen Team}},
    month  = {February},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.5}
}

Qwen3.5-397B-A17B

[!Note] 本仓库包含采用 Hugging Face Transformers 格式的后训练模型的权重及配置文件。

这些资源与 Hugging Face Transformers、vLLM、SGLang、KTransformers 等工具兼容。

[!Tip] 对于寻求无需维护基础设施的托管式、可扩展推理服务的用户，阿里云 Model Studio 提供了官方 Qwen API 服务。

特别地，Qwen3.5-Plus 是与 Qwen3.5-397B-A17B 对应的托管版本，具备更多生产级特性，例如默认 100 万上下文长度、官方内置工具以及自适应工具调用能力。更多信息，请参考用户指南。

Qwen3.5 亮点

Qwen3.5 具有以下增强特性：

统一视觉-语言基础：通过对多模态 tokens 进行早期融合训练，实现了与 Qwen3 跨代际的性能持平，并在推理、代码生成、智能体及视觉理解等基准测试中超越了 Qwen3-VL 模型。
高效混合架构：门控 Delta 网络结合稀疏混合专家（Mixture-of-Experts）架构，在实现高吞吐量推理的同时，将延迟和成本开销降至最低。
可扩展强化学习泛化：在包含数百万智能体的环境中进行规模化强化学习训练，并采用逐步复杂的任务分布，以实现强大的现实世界适应能力。
全球语言覆盖：扩展支持至 201 种语言及方言，凭借细致的文化与区域理解能力，实现包容性的全球部署。
下一代训练基础设施：多模态训练效率接近文本训练的 100%，异步强化学习框架支持大规模智能体架构与环境编排。

Benchmark Results

更多详情，请参阅我们的博客文章 Qwen3.5。

模型概述

类型：带视觉编码器的因果语言模型
训练阶段：预训练与后训练
语言模型
- 参数数量：总计3970亿，激活170亿
- 隐藏维度：4096
- 词嵌入：248320（填充后）
- 层数：60
  - 隐藏层布局：15 * (3 * (门控DeltaNet -> MoE) -> 1 * (门控注意力 -> MoE))
- 门控DeltaNet：
  - 线性注意力头数量：V为64，QK为16
  - 头维度：128
- 门控注意力：
  - 注意力头数量：Q为32，KV为2
  - 头维度：256
  - 旋转位置嵌入维度：64
- 混合专家（Mixture Of Experts）：
  - 专家数量：512
  - 激活专家数量：10个路由专家 + 1个共享专家
  - 专家中间维度：1024
- 语言模型输出：248320（填充后）
- MTP：采用多步训练
上下文长度：原生262,144 tokens，可扩展至1,010,000 tokens。

基准测试结果

语言能力

	GPT5.2	Claude 4.5 Opus	Gemini-3 Pro	Qwen3-Max-Thinking	K2.5-1T-A32B	Qwen3.5-397B-A17B
知识能力
MMLU-Pro	87.4	89.5	89.8	85.7	87.1	87.8
MMLU-Redux	95.0	95.6	95.9	92.8	94.5	94.9
SuperGPQA	67.9	70.6	74.0	67.3	69.2	70.4
C-Eval	90.5	92.2	93.4	93.7	94.0	93.0
指令遵循
IFEval	94.8	90.9	93.5	93.4	93.9	92.6
IFBench	75.4	58.0	70.4	70.9	70.2	76.5
MultiChallenge	57.9	54.2	64.2	63.3	62.7	67.6
长上下文
AA-LCR	72.7	74.0	70.7	68.7	70.0	68.7
LongBench v2	54.5	64.4	68.2	60.6	61.0	63.2
STEM领域
GPQA	92.4	87.0	91.9	87.4	87.6	88.4
HLE	35.5	30.8	37.5	30.2	30.1	28.7
HLE-Verified¹	43.3	38.8	48	37.6	--	37.6
推理能力
LiveCodeBench v6	87.7	84.8	90.7	85.9	85.0	83.6
HMMT Feb 25	99.4	92.9	97.3	98.0	95.4	94.8
HMMT Nov 25	100	93.3	93.3	94.7	91.1	92.7
IMOAnswerBench	86.3	84.0	83.3	83.9	81.8	80.9
AIME26	96.7	93.3	90.6	93.3	93.3	91.3
通用智能体
BFCL-V4	63.1	77.5	72.5	67.7	68.3	72.9
TAU2-Bench	87.1	91.6	85.4	84.6	77.0	86.7
VITA-Bench	38.2	56.3	51.6	40.9	41.9	49.7
DeepPlanning	44.6	33.9	23.3	28.7	14.5	34.3
Tool Decathlon	43.8	43.5	36.4	18.8	27.8	38.3
MCP-Mark	57.5	42.3	53.9	33.5	29.5	46.1
搜索智能体³
HLE w/ tool	45.5	43.4	45.8	49.8	50.2	48.3
BrowseComp	65.8	67.8	59.2	53.9	--/74.9	69.0/78.6
BrowseComp-zh	76.1	62.4	66.8	60.9	--	70.3
WideSearch	76.8	76.4	68.0	57.9	72.7	74.0
Seal-0	45.0	47.7	45.5	46.9	57.4	46.9
多语言能力
MMMLU	89.5	90.1	90.6	84.4	86.0	88.5
MMLU-ProX	83.7	85.7	87.7	78.5	82.3	84.7
NOVA-63	54.6	56.7	56.7	54.2	56.0	59.1
INCLUDE	87.5	86.2	90.5	82.3	83.3	85.6
Global PIQA	90.9	91.6	93.2	86.0	89.3	89.8
PolyMATH	62.5	79.0	81.6	64.7	43.1	73.3
WMT24++	78.8	79.7	80.7	77.6	77.6	78.9
MAXIFE	88.4	79.2	87.5	84.0	72.8	88.2
代码智能体
SWE-bench Verified	80.0	80.9	76.2	75.3	76.8	76.4
SWE-bench Multilingual	72.0	77.5	65.0	66.7	73.0	69.3
SecCodeBench	68.7	68.6	62.4	57.5	61.3	68.3
Terminal Bench 2	54.0	59.3	54.2	22.5	50.8	52.5

视觉语言

	GPT5.2	Claude 4.5 Opus	Gemini-3 Pro	Qwen3-VL-235B-A22B	K2.5-1T-A32B	Qwen3.5-397B-A17B
STEM 与谜题
MMMU	86.7	80.7	87.2	80.6	84.3	85.0
MMMU-Pro	79.5	70.6	81.0	69.3	78.5	79.0
MathVision	83.0	74.3	86.6	74.6	84.2	88.6
Mathvista(mini)	83.1	80.0	87.9	85.8	90.1	90.3
We-Math	79.0	70.0	86.9	74.8	84.7	87.9
DynaMath	86.8	79.7	85.1	82.8	84.4	86.3
ZEROBench	9	3	10	4	9	12
ZEROBench_sub	33.2	28.4	39.0	28.4	33.5	41.0
BabyVision	34.4	14.2	49.7	22.2	36.5	52.3/43.3
通用视觉问答
RealWorldQA	83.3	77.0	83.3	81.3	81.0	83.9
MMStar	77.1	73.2	83.1	78.7	80.5	83.8
HallusionBench	65.2	64.1	68.6	66.7	69.8	71.4
MMBench_EN-DEV-v1.1	88.2	89.2	93.7	89.7	94.2	93.7
SimpleVQA	55.8	65.7	73.2	61.3	71.2	67.1
文本识别与文档理解
OmniDocBench1.5	85.7	87.7	88.5	84.5	88.8	90.8
CharXiv(RQ)	82.1	68.5	81.4	66.1	77.5	80.8
MMLongBench-Doc	--	61.9	60.5	56.2	58.5	61.5
CC-OCR	70.3	76.9	79.0	81.5	79.7	82.0
AI2D_TEST	92.2	87.7	94.1	89.2	90.8	93.9
OCRBench	80.7	85.8	90.4	87.5	92.3	93.1
空间智能
ERQA	59.8	46.8	70.5	52.5	--	67.5
CountBench	91.9	90.6	97.3	93.7	94.1	97.2
RefCOCO(avg)	--	--	84.1	91.1	87.8	92.3
ODInW13	--	--	46.3	43.2	--	47.0
EmbSpatialBench	81.3	75.7	61.2	84.3	77.4	84.5
RefSpatialBench	--	--	65.5	69.9	--	73.6
LingoQA	68.8	78.8	72.8	66.8	68.2	81.6
V*	75.9	67.0	88.0	85.9	77.0	95.8/91.1
Hypersim	--	--	--	11.0	--	12.5
SUNRGBD	--	--	--	34.9	--	38.3
Nuscene	--	--	--	13.9	--	16.0
视频理解
VideoMME_{（带字幕）}	86	77.6	88.4	83.8	87.4	87.5
VideoMME_{（无字幕）}	85.8	81.4	87.7	79.0	83.2	83.7
VideoMMMU	85.9	84.4	87.6	80.0	86.6	84.7
MLVU（M-Avg）	85.6	81.7	83.0	83.8	85.0	86.7
MVBench	78.1	67.2	74.1	75.2	73.5	77.6
LVBench	73.7	57.3	76.2	63.6	75.9	75.5
MMVU	80.8	77.3	77.5	71.1	80.4	75.4
视觉智能体
ScreenSpot Pro	--	45.7	72.7	62.0	--	65.6
OSWorld-Verified	38.2	66.3	--	38.1	63.3	62.2
AndroidWorld	--	--	--	63.7	--	66.8
医疗视觉问答
SLAKE	76.9	76.4	81.3	72.5	81.6	79.9
PMC-VQA	58.9	59.9	62.3	56.1	63.3	64.2
MedXpertQA-MM	73.3	63.6	76.0	47.6	65.3	70.0

快速入门

[!Important] Qwen3.5 模型默认以思考模式运行，在生成最终响应前会先产生由 </think>\n...superscript:\n\n 标识的思考内容。如需禁用思考内容并获取直接响应，请参考此处的示例。

为简化集成流程，我们建议通过 API 使用 Qwen3.5。以下是通过兼容 OpenAI 的 API 使用 Qwen3.5 的指南。

部署 Qwen3.5

Qwen3.5 可通过主流推理框架以 API 形式部署。下文将展示启动 Qwen3.5 模型兼容 OpenAI API 服务的示例命令。

[!Important] 不同框架的推理效率和吞吐量差异显著。建议使用最新版本的框架，以确保最佳性能和兼容性。对于生产工作负载或高吞吐量场景，强烈推荐使用 SGLang、KTransformers 或 vLLM 等专用部署引擎。

[!Important] 该模型的默认上下文长度为 262,144 个 tokens。若遇到内存不足（OOM）错误，可考虑减小上下文窗口。但由于 Qwen3.5 利用扩展上下文处理复杂任务，我们建议保持至少 128K tokens 的上下文长度，以保留其思考能力。

SGLang

SGLang 是一个用于大型语言模型和视觉语言模型的快速部署框架。 Qwen3.5 需要使用开源仓库主分支的 SGLang，可在全新环境中通过以下命令安装：

uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python&egg=sglang[all]'

有关更多详细信息，请参见其文档。

以下操作将在 http://localhost:8000/v1 创建 API 端点：

标准版本：可使用以下命令创建 API 端点，该端点支持最大上下文长度为 262,144 个 token，并在 8 块 GPU 上使用张量并行。

python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3

工具调用：若要支持工具调用，可使用以下命令。

python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coder

多 token 预测（MTP）：推荐使用以下命令进行 MTP：

python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

vLLM

uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

有关更多详情，请参见其文档。

如需详细的 Qwen3.5 使用指南，请参阅vLLM Qwen3.5 教程。

以下操作将在 http://localhost:8000/v1 创建 API 端点：

标准版：以下命令可用于创建 API 端点，其最大上下文长度为 262,144 tokens，并在 8 块 GPU 上使用张量并行。
```
vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 
```

工具调用：要支持工具使用，可使用以下命令。

vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

多 Token 预测（MTP）：推荐使用以下命令进行 MTP：

vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

纯文本模式：以下命令会跳过视觉编码器和多模态分析，以释放内存用于额外的 KV 缓存：

vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only

KTransformers

KTransformers 是一个灵活的框架，可通过 CPU-GPU 异构计算体验前沿的 LLM 推理优化。要使用 KTransformers 运行 Qwen3.5，请参见KTransformers 部署指南。

Hugging Face Transformers

Hugging Face Transformers 包含一个轻量级服务器，可用于快速测试和中等负载部署。运行 Qwen3.5 需要最新版本的 transformers：

pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"

有关更多详情，请参见其文档。同时，请确保已安装torchvision和pillow。

然后，运行transformers serve以启动服务器，其API端点位于http://localhost:8000/v1；如果有可用的加速器，服务器会将模型部署到加速器上：

transformers serve --force-model Qwen/Qwen3.5-397B-A17B --port 8000 --continuous-batching

通过聊天补全 API 使用 Qwen3.5

聊天补全 API 可通过标准 HTTP 请求或 OpenAI SDK 进行访问。此处，我们展示使用 OpenAI Python SDK 的示例。

开始前，请确保已安装该 SDK，并配置好 API 密钥和 API 基础 URL，例如：

pip install -U openai

# Set the following accordingly
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"

[!Tip] 我们建议使用以下采样参数集进行生成

思考模式：temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

指令（或非思考）模式：temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

请注意，不同推理框架对采样参数的支持情况有所不同。

纯文本输入

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {"role": "user", "content": "Type \"I love Qwen3.5\" backwards"},
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

图像输入

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg"
                }
            },
            {
                "type": "text",
                "text": "The centres of the four illustrated circles are in the corners of the square. The two big circles touch each other and also the two little circles. With which factor do you have to multiply the radii of the little circles to obtain the radius of the big circles?\nChoices:\n(A) $\\frac{2}{9}$\n(B) $\\sqrt{5}$\n(C) $0.8 \\cdot \\pi$\n(D) 2.5\n(E) $1+\\sqrt{2}$"
            }
        ]
    }
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

视频输入

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/video/N1cdUjctpG8.mp4"
                }
            },
            {
                "type": "text",
                "text": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?"
            }
        ]
    }
]

# When vLLM is launched with `--media-io-kwargs '{"video": {"num_frames": -1}}'`,
# video frame sampling can be configured via `extra_body` (e.g., by setting `fps`).
# This feature is currently supported only in vLLM.
#
# By default, `fps=2` and `do_sample_frames=True`.
# With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
        "mm_processor_kwargs": {"fps": 2, "do_sample_frames": True},
    }, 
)

print("Chat response:", chat_response)

指令（或无思考）模式

[!Important] Qwen3.5 不正式支持 Qwen3 的软切换，即 /think 和 /nothink。

Qwen3.5 默认会在响应前进行思考。您可以通过配置 API 参数让模型直接响应，无需思考。例如，

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/RealWorld/RealWorld-04.png"
                }
            },
            {
                "type": "text",
                "text": "Where is this?"
            }
        ]
    }
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": False},
    }, 
)
print("Chat response:", chat_response)

[!Note] 若您使用的是阿里云 Model Studio 的 API，除了修改 model 外，请使用 "enable_thinking": False，而非 "chat_template_kwargs": {"enable_thinking": False}。

智能体使用

Qwen3.5 在工具调用能力方面表现卓越。

Qwen-Agent

我们建议使用 Qwen-Agent，以快速基于 Qwen3.5 构建智能体应用。

您可以通过 MCP 配置文件定义可用工具，也可以使用 Qwen-Agent 的集成工具，或自行集成其他工具。

import os
from qwen_agent.agents import Assistant

# Define LLM
# Using Alibaba Cloud Model Studio
llm_cfg = {
    # Use the OpenAI-compatible model service provided by DashScope:
    'model': 'Qwen3.5-397B-A17B',
    'model_type': 'qwenvl_oai',
    'model_server': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
    'api_key': os.getenv('DASHSCOPE_API_KEY'),

    'generate_cfg': {
        'use_raw_api': True,
        # When using Dash Scope OAI API, pass the parameter of whether to enable thinking mode in this way
        'extra_body': {
            'enable_thinking': True
        },
    },
}

# Using OpenAI-compatible API endpoint.
# functionality of the deployment frameworks and let Qwen-Agent automate the related operations.
#
# llm_cfg = {
#     # Use your own model service compatible with OpenAI API by vLLM/SGLang:
#     'model': 'Qwen/Qwen3.5-397B-A17B',
#     'model_type': 'qwenvl_oai',
#     'model_server': 'http://localhost:8000/v1',  # api_base
#     'api_key': 'EMPTY',
#
#     'generate_cfg': {
#         'use_raw_api': True,
#         # When using vLLM/SGLang OAI API, pass the parameter of whether to enable thinking mode in this way
#         'extra_body': {
#             'chat_template_kwargs': {'enable_thinking': True}
#         },
#     },
# }

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            "filesystem": {
                "command": "npx",
                "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/xxxx/Desktop"]
            }
        }
    }
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'Help me organize my desktop.'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

# Streaming generation
messages = [{'role': 'user', 'content': 'Develop a dog website and save it on the desktop'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Qwen Code

Qwen Code 是一款面向终端的开源 AI 代理，专为 Qwen 模型优化。它能帮助您理解大型代码库、自动化繁琐工作，并加快开发交付速度。

更多信息，请参阅 Qwen Code。

超长文本处理

目前已有多个推理框架支持 YaRN，例如 transformers、vllm、ktransformers 和 sglang。通常，在支持的框架中启用 YaRN 有两种方法：

修改模型配置文件：在 config.json 文件中，将 text_config 里的 rope_parameters 字段修改为：

{
    "mrope_interleaved": true,
    "mrope_section": [
        11,
        11,
        10
    ],
    "rope_type": "yarn",
    "rope_theta": 10000000,
    "partial_rotary_factor": 0.25,
    "factor": 4.0,
    "original_max_position_embeddings": 262144,
}

传递命令行参数：

对于 vllm，您可以使用

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000

对于 sglang 和 ktransformers，您可以使用

SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000

[!NOTE] 所有主流开源框架均实现了静态 YaRN，这意味着缩放因子不随输入长度变化，可能会影响短文本的性能。我们建议仅在需要处理长上下文时才修改 rope_parameters 配置。同时，建议根据需要调整 factor。例如，如果您的应用场景中典型上下文长度为 524,288 tokens，将 factor 设置为 2.0 会更合适。

最佳实践

为实现最佳性能，我们建议采用以下设置：

采样参数：
- 思考模式下，建议使用 Temperature=0.6、TopP=0.95、TopK=20 和 MinP=0；非思考模式下，建议使用 Temperature=0.7、TopP=0.8、TopK=20 和 MinP=0。
- 在支持的框架中，可将 presence_penalty 参数调整在 0 到 2 之间，以减少无意义的重复。但需注意，较高的取值偶尔可能导致语言混杂，并略微降低模型性能。
充足的输出长度：对于大多数查询，建议使用 32,768 tokens 的输出长度。在数学和编程竞赛等高度复杂问题的基准测试中，建议将最大输出长度设置为 81,920 tokens。这能为模型提供足够的空间来生成详细且全面的响应，从而提升整体性能。
标准化输出格式：进行基准测试时，建议通过提示词标准化模型输出。
- 数学问题：在提示词中加入“请逐步推理，并将最终答案放在 \boxed{} 内。”
- 多项选择题：在提示词中添加以下 JSON 结构以标准化响应：“请在 answer 字段中仅用选项字母展示您的选择，例如："answer": "C"。”
历史记录中不含思考内容：在多轮对话中，历史模型输出应仅包含最终输出部分，无需包含思考过程。这一点已在提供的 Jinja2 对话模板中实现。但对于未直接使用 Jinja2 对话模板的框架，需由开发者确保遵循此最佳实践。
长视频理解：为优化纯文本和图像的推理效率，发布的 video_preprocessor_config.json 中 size 参数采用了保守配置。建议将视频预处理配置文件中的 longest_edge 参数设置为 469,762,048（对应 224k 视频 tokens），以支持小时级视频的更高帧率采样，从而获得更优性能。例如：
```
{"longest_edge": 469762048, "shortest_edge": 4096}
```
或者，通过引擎启动参数覆盖默认值。实现细节请参考：vLLM / SGLang。

引用

如果您觉得我们的工作有帮助，欢迎引用我们的成果。

@misc{qwen3.5,
    title  = {{Qwen3.5}: Towards Native Multimodal Agents},
    author = {{Qwen Team}},
    month  = {February},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.5}
}