本文介绍 DeepSeek-V3,这是一个拥有 6710 亿个参数的强大混合专家 (MoE) 语言模型,每个标记激活 370 亿个参数。为了实现高效的推理和经济高效的训练,DeepSeek-V3 采用了多头潜伏注意力 (MLA) 和 DeepSeekMoE 架构,这些架构已经在 DeepSeek-V2 中得到了充分验证。此外,DeepSeek-V3 开创性地采用了一种无辅助损失的负载均衡策略,并设定了一个多标记预测训练目标,以获得更强的性能。 我们在 14.8 万亿个多样且高质量的标记上对 DeepSeek-V3 进行了预训练,随后进行了监督式微调和强化学习阶段,以充分发挥其能力。综合评估表明,DeepSeek-V3 的性能优于其他开源模型,并且达到了与领先的闭源模型相当的水平。 尽管性能出色,但 DeepSeek-V3 的完整训练只需要 278.8 万个 H800 GPU 小时。此外,其训练过程非常稳定。在整个训练过程中,我们没有遇到任何不可恢复的损失峰值或执行任何回滚操作。
架构:创新的负载均衡策略和训练目标
预训练:走向极致的训练效率
后训练:从 DeepSeek-R1 中蒸馏知识
| 模型 | 总参数数 | 激活参数数 | 上下文长度 | 下载 |
|---|---|---|---|---|
| DeepSeek-V3-Base | 6710亿 | 370亿 | 12.8万 | 🤗 HuggingFace |
| DeepSeek-V3 | 6710亿 | 370亿 | 12.8万 | 🤗 HuggingFace |
注意:DeepSeek-V3 模型在 HuggingFace 上的总大小为 6850 亿,其中包括 6710 亿的主模型权重和 140 亿的多标记预测 (MTP) 模块权重。
为了确保最佳性能和灵活性,我们与开源社区和硬件供应商合作,提供了多种方式可以在本地运行该模型。有关分步指南,请查看第 6 节:如何在本地运行。
对于希望深入研究的开发人员,我们建议查看 README_WEIGHTS.md,了解主模型权重和多标记预测 (MTP) 模块的详细信息。请注意,MTP 支持目前正在社区中积极开发,我们欢迎您的贡献和反馈。
| Benchmark (Metric) | # Shots | DeepSeek-V2 | Qwen2.5 72B | LLaMA3.1 405B | DeepSeek-V3 | |
|---|---|---|---|---|---|---|
| Architecture | - | MoE | Dense | Dense | MoE | |
| # Activated Params | - | 21B | 72B | 405B | 37B | |
| # Total Params | - | 236B | 72B | 405B | 671B | |
| English | Pile-test (BPB) | - | 0.606 | 0.638 | 0.542 | 0.548 |
| BBH (EM) | 3-shot | 78.8 | 79.8 | 82.9 | 87.5 | |
| MMLU (Acc.) | 5-shot | 78.4 | 85.0 | 84.4 | 87.1 | |
| MMLU-Redux (Acc.) | 5-shot | 75.6 | 83.2 | 81.3 | 86.2 | |
| MMLU-Pro (Acc.) | 5-shot | 51.4 | 58.3 | 52.8 | 64.4 | |
| DROP (F1) | 3-shot | 80.4 | 80.6 | 86.0 | 89.0 | |
| ARC-Easy (Acc.) | 25-shot | 97.6 | 98.4 | 98.4 | 98.9 | |
| ARC-Challenge (Acc.) | 25-shot | 92.2 | 94.5 | 95.3 | 95.3 | |
| HellaSwag (Acc.) | 10-shot | 87.1 | 84.8 | 89.2 | 88.9 | |
| PIQA (Acc.) | 0-shot | 83.9 | 82.6 | 85.9 | 84.7 | |
| WinoGrande (Acc.) | 5-shot | 86.3 | 82.3 | 85.2 | 84.9 | |
| RACE-Middle (Acc.) | 5-shot | 73.1 | 68.1 | 74.2 | 67.1 | |
| RACE-High (Acc.) | 5-shot | 52.6 | 50.3 | 56.8 | 51.3 | |
| TriviaQA (EM) | 5-shot | 80.0 | 71.9 | 82.7 | 82.9 | |
| NaturalQuestions (EM) | 5-shot | 38.6 | 33.2 | 41.5 | 40.0 | |
| AGIEval (Acc.) | 0-shot | 57.5 | 75.8 | 60.6 | 79.6 | |
| Code | HumanEval (Pass@1) | 0-shot | 43.3 | 53.0 | 54.9 | 65.2 |
| MBPP (Pass@1) | 3-shot | 65.0 | 72.6 | 68.4 | 75.4 | |
| LiveCodeBench-Base (Pass@1) | 3-shot | 11.6 | 12.9 | 15.5 | 19.4 | |
| CRUXEval-I (Acc.) | 2-shot | 52.5 | 59.1 | 58.5 | 67.3 | |
| CRUXEval-O (Acc.) | 2-shot | 49.8 | 59.9 | 59.9 | 69.8 | |
| Math | GSM8K (EM) | 8-shot | 81.6 | 88.3 | 83.5 | 89.3 |
| MATH (EM) | 4-shot | 43.4 | 54.4 | 49.0 | 61.6 | |
| MGSM (EM) | 8-shot | 63.6 | 76.2 | 69.9 | 79.8 | |
| CMath (EM) | 3-shot | 78.7 | 84.5 | 77.3 | 90.7 | |
| Chinese | CLUEWSC (EM) | 5-shot | 82.0 | 82.5 | 83.0 | 82.7 |
| C-Eval (Acc.) | 5-shot | 81.4 | 89.2 | 72.5 | 90.1 | |
| CMMLU (Acc.) | 5-shot | 84.0 | 89.5 | 73.7 | 88.8 | |
| CMRC (EM) | 1-shot | 77.4 | 75.8 | 76.0 | 76.3 | |
| C3 (Acc.) | 0-shot | 77.4 | 76.7 | 79.7 | 78.6 | |
| CCPM (Acc.) | 0-shot | 93.0 | 88.5 | 78.6 | 92.0 | |
| Multilingual | MMMLU-non-English (Acc.) | 5-shot | 64.0 | 74.8 | 73.8 | 79.4 |
Note: Best results are shown in bold. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3 achieves the best performance on most benchmarks, especially on math and code tasks.
For more evaluation details, please check our paper.
Evaluation results on the Needle In A Haystack (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to 128K.
| Benchmark (Metric) | DeepSeek V2-0506 | DeepSeek V2.5-0905 | Qwen2.5 72B-Inst. | Llama3.1 405B-Inst. | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | |
|---|---|---|---|---|---|---|---|---|
| Architecture | MoE | MoE | Dense | Dense | - | - | MoE | |
| # Activated Params | 21B | 21B | 72B | 405B | - | - | 37B | |
| # Total Params | 236B | 236B | 72B | 405B | - | - | 671B | |
| English | MMLU (EM) | 78.2 | 80.6 | 85.3 | 88.6 | 88.3 | 87.2 | 88.5 |
| MMLU-Redux (EM) | 77.9 | 80.3 | 85.6 | 86.2 | 88.9 | 88.0 | 89.1 | |
| MMLU-Pro (EM) | 58.5 | 66.2 | 71.6 | 73.3 | 78.0 | 72.6 | 75.9 | |
| DROP (3-shot F1) | 83.0 | 87.8 | 76.7 | 88.7 | 88.3 | 83.7 | 91.6 | |
| IF-Eval (Prompt Strict) | 57.7 | 80.6 | 84.1 | 86.0 | 86.5 | 84.3 | 86.1 | |
| GPQA-Diamond (Pass@1) | 35.3 | 41.3 | 49.0 | 51.1 | 65.0 | 49.9 | 59.1 | |
| SimpleQA (Correct) | 9.0 | 10.2 | 9.1 | 17.1 | 28.4 | 38.2 | 24.9 | |
| FRAMES (Acc.) | 66.9 | 65.4 | 69.8 | 70.0 | 72.5 | 80.5 | 73.3 | |
| LongBench v2 (Acc.) | 31.6 | 35.4 | 39.4 | 36.1 | 41.0 | 48.1 | 48.7 | |
| Code | HumanEval-Mul (Pass@1) | 69.3 | 77.4 | 77.3 | 77.2 | 81.7 | 80.5 | 82.6 |
| LiveCodeBench (Pass@1-COT) | 18.8 | 29.2 | 31.1 | 28.4 | 36.3 | 33.4 | 40.5 | |
| LiveCodeBench (Pass@1) | 20.3 | 28.4 | 28.7 | 30.1 | 32.8 | 34.2 | 37.6 | |
| Codeforces (Percentile) | 17.5 | 35.6 | 24.8 | 25.3 | 20.3 | 23.6 | 51.6 | |
| SWE Verified (Resolved) | - | 22.6 | 23.8 | 24.5 | 50.8 | 38.8 | 42.0 | |
| Aider-Edit (Acc.) | 60.3 | 71.6 | 65.4 | 63.9 | 84.2 | 72.9 | 79.7 | |
| Aider-Polyglot (Acc.) | - | 18.2 | 7.6 | 5.8 | 45.3 | 16.0 | 49.6 | |
| Math | AIME 2024 (Pass@1) | 4.6 | 16.7 | 23.3 | 23.3 | 16.0 | 9.3 | 39.2 |
| MATH-500 (EM) | 56.3 | 74.7 | 80.0 | 73.8 | 78.3 | 74.6 | 90.2 | |
| CNMO 2024 (Pass@1) | 2.8 | 10.8 | 15.9 | 6.8 | 13.1 | 10.8 | 43.2 | |
| Chinese | CLUEWSC (EM) | 89.9 | 90.4 | 91.4 | 84.7 | 85.4 | 87.9 | 90.9 |
| C-Eval (EM) | 78.6 | 79.5 | 86.1 | 61.5 | 76.7 | 76.0 | 86.5 | |
| C-SimpleQA (Correct) | 48.5 | 54.1 | 48.4 | 50.4 | 51.3 | 59.3 | 64.8 |
Note: All models are evaluated in a configuration that limits the output length to 8K. Benchmarks containing fewer than 1000 samples are tested multiple times using varying temperature settings to derive robust final results. DeepSeek-V3 stands as the best-performing open-source model, and also exhibits competitive performance against frontier closed-source models.
| Model | Arena-Hard | AlpacaEval 2.0 |
|---|---|---|
| DeepSeek-V2.5-0905 | 76.2 | 50.5 |
| Qwen2.5-72B-Instruct | 81.2 | 49.1 |
| LLaMA-3.1 405B | 69.3 | 40.5 |
| GPT-4o-0513 | 80.4 | 51.1 |
| Claude-Sonnet-3.5-1022 | 85.2 | 52.0 |
| DeepSeek-V3 | 85.5 | 70.0 |
Note: English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric.
You can chat with DeepSeek-V3 on DeepSeek's official website: chat.deepseek.com
We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com
DeepSeek-V3 can be deployed locally using the following hardware and open-source community software:
Since FP8 training is natively adopted in our framework, we only provide FP8 weights. If you require BF16 weights for experimentation, you can use the provided conversion script to perform the transformation.
Here is an example of converting FP8 weights to BF16:
cd inference
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights注意:目前还不直接支持 Huggingface 的 Transformers。
首先,克隆我们的 DeepSeek-V3 GitHub 存储库:
git clone https://github.com/deepseek-ai/DeepSeek-V3.git进入 inference 文件夹,并安装 requirements.txt 中列出的依赖项。
cd DeepSeek-V3/inference
pip install -r requirements.txt从 HuggingFace 下载模型权重,并将它们放入 /path/to/DeepSeek-V3 文件夹中。
将 HuggingFace 模型权重转换为特定格式:
python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 --save-path /path/to/DeepSeek-V3-Demo --n-experts 256 --model-parallel 16然后,您就可以与 DeepSeek-V3 聊天了:
torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200或者对给定文件进行批量推断:
torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --input-file $FILESGLang 目前支持 MLA 优化、FP8 (W8A8)、FP8 KV 缓存和 Torch Compile,在开源框架中提供最先进的延迟和吞吐量性能。
值得注意的是,SGLang v0.4.1 完全支持在 NVIDIA 和 AMD GPU 上运行 DeepSeek-V3,使其成为一个高度通用且强大的解决方案。
以下是 SGLang 团队的启动说明:https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3
LMDeploy 是一款灵活且高性能的推理和服务框架,专为大型语言模型定制,现在支持 DeepSeek-V3。它提供离线管道处理和在线部署功能,无缝集成 PyTorch 基于的工作流。
有关使用 LMDeploy 运行 DeepSeek-V3 的全面分步骤说明,请参考此处:https://github.com/InternLM/lmdeploy/issues/2960
TensorRT-LLM 现在支持 DeepSeek-V3 模型,提供 BF16 和 INT4/INT8 权重Only 等精度选项。FP8 的支持目前正在进行中,并将很快发布。您可以通过以下链接访问 TRTLLM 的自定义分支,该分支专为 DeepSeek-V3 支持而设计,以便直接体验新功能:https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3。
vLLM v0.6.6 支持在 NVIDIA 和 AMD GPU 上使用 FP8 和 BF16 模式进行 DeepSeek-V3 推理。除了标准技术外,vLLM 还提供 管道并行 功能,使您可以将此模型运行在多个通过网络连接的计算机上。有关详细指南,请参考vLLM 指令。也欢迎您关注 增强计划。
与 AMD 团队合作,我们实现了使用 SGLang 对 AMD GPU 的第一天支持,并完全兼容 FP8 和 BF16 精度。有关详细指南,请参考SGLang 指令。
华为昇腾社区的 MindIE 框架已成功适配了 DeepSeek-V3 的 BF16 版本。有关昇腾 NPU 的分步骤指南,请遵循此处说明。
此代码仓库在 MIT 许可证 下获得许可。DeepSeek-V3 Base/Chat 模型的使用受 模型许可证 的约束。DeepSeek-V3 系列(包括 Base 和 Chat)支持商业用途。
@misc{deepseekai2024deepseekv3technicalreport,
title={DeepSeek-V3 Technical Report},
author={DeepSeek-AI},
year={2024},
eprint={2412.19437},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.19437},
}如果您有任何疑问,请创建 issue 或发送电子邮件至 service@deepseek.com。