我们介绍 DeepSeek-V3,这是一个强大的混合专家(MoE)语言模型,总参数量为 6710 亿,每个标记激活 370 亿个参数。为了实现高效的推理和具有成本效益的训练,DeepSeek-V3 采用了多头潜在注意力(MLA)和 DeepSeekMoE 架构,这些架构已在 DeepSeek-V2 中得到了充分验证。此外,DeepSeek-V3 开创了一种无辅助损失的负载平衡策略,并设定了一个多标记预测训练目标,以实现更强的性能。 我们在 14.8 万亿个不同且高质量的标记上预训练 DeepSeek-V3,然后进行监督式微调和强化学习阶段,以充分发挥其能力。 综合评估表明,DeepSeek-V3 的性能优于其他开源模型,并且达到了与领先的封闭源代码模型相当的性能。尽管性能优异,但 DeepSeek-V3 的完整训练只需要 278.8 万个 H800 GPU 小时。此外,它的训练过程非常稳定。在整个训练过程中,我们没有遇到任何无法恢复的损失峰值或执行任何回滚。
架构:创新的负载平衡策略和训练目标
预训练:走向极致的训练效率
后训练:从 DeepSeek-R1 中提取知识
| 模型 | #总参数 | #激活参数 | 上下文长度 | 下载 |
|---|---|---|---|---|
| DeepSeek-V3-Base | 6710 亿 | 370 亿 | 128K | 🤗 HuggingFace |
| DeepSeek-V3 | 6710 亿 | 370 亿 | 128K | 🤗 HuggingFace |
注意:HuggingFace 上 DeepSeek-V3 模型的总大小为 6850 亿,其中包括 6710 亿的主模型权重和 140 亿的多标记预测(MTP)模块权重。
为了确保最佳性能和灵活性,我们与开源社区和硬件供应商合作,提供了多种方式在本地运行模型。有关逐步指南,请查看第 6 节:如何在本地运行。
对于希望深入研究的开发人员,我们建议查看 README_WEIGHTS.md,了解主模型权重和多标记预测(MTP)模块的详细信息。请注意,MTP 支持目前正在社区内积极开发中,我们欢迎您的贡献和反馈。
| Benchmark (Metric) | # Shots | DeepSeek-V2 | Qwen2.5 72B | LLaMA3.1 405B | DeepSeek-V3 | |
|---|---|---|---|---|---|---|
| Architecture | - | MoE | Dense | Dense | MoE | |
| # Activated Params | - | 21B | 72B | 405B | 37B | |
| # Total Params | - | 236B | 72B | 405B | 671B | |
| English | Pile-test (BPB) | - | 0.606 | 0.638 | 0.542 | 0.548 |
| BBH (EM) | 3-shot | 78.8 | 79.8 | 82.9 | 87.5 | |
| MMLU (Acc.) | 5-shot | 78.4 | 85.0 | 84.4 | 87.1 | |
| MMLU-Redux (Acc.) | 5-shot | 75.6 | 83.2 | 81.3 | 86.2 | |
| MMLU-Pro (Acc.) | 5-shot | 51.4 | 58.3 | 52.8 | 64.4 | |
| DROP (F1) | 3-shot | 80.4 | 80.6 | 86.0 | 89.0 | |
| ARC-Easy (Acc.) | 25-shot | 97.6 | 98.4 | 98.4 | 98.9 | |
| ARC-Challenge (Acc.) | 25-shot | 92.2 | 94.5 | 95.3 | 95.3 | |
| HellaSwag (Acc.) | 10-shot | 87.1 | 84.8 | 89.2 | 88.9 | |
| PIQA (Acc.) | 0-shot | 83.9 | 82.6 | 85.9 | 84.7 | |
| WinoGrande (Acc.) | 5-shot | 86.3 | 82.3 | 85.2 | 84.9 | |
| RACE-Middle (Acc.) | 5-shot | 73.1 | 68.1 | 74.2 | 67.1 | |
| RACE-High (Acc.) | 5-shot | 52.6 | 50.3 | 56.8 | 51.3 | |
| TriviaQA (EM) | 5-shot | 80.0 | 71.9 | 82.7 | 82.9 | |
| NaturalQuestions (EM) | 5-shot | 38.6 | 33.2 | 41.5 | 40.0 | |
| AGIEval (Acc.) | 0-shot | 57.5 | 75.8 | 60.6 | 79.6 | |
| Code | HumanEval (Pass@1) | 0-shot | 43.3 | 53.0 | 54.9 | 65.2 |
| MBPP (Pass@1) | 3-shot | 65.0 | 72.6 | 68.4 | 75.4 | |
| LiveCodeBench-Base (Pass@1) | 3-shot | 11.6 | 12.9 | 15.5 | 19.4 | |
| CRUXEval-I (Acc.) | 2-shot | 52.5 | 59.1 | 58.5 | 67.3 | |
| CRUXEval-O (Acc.) | 2-shot | 49.8 | 59.9 | 59.9 | 69.8 | |
| Math | GSM8K (EM) | 8-shot | 81.6 | 88.3 | 83.5 | 89.3 |
| MATH (EM) | 4-shot | 43.4 | 54.4 | 49.0 | 61.6 | |
| MGSM (EM) | 8-shot | 63.6 | 76.2 | 69.9 | 79.8 | |
| CMath (EM) | 3-shot | 78.7 | 84.5 | 77.3 | 90.7 | |
| Chinese | CLUEWSC (EM) | 5-shot | 82.0 | 82.5 | 83.0 | 82.7 |
| C-Eval (Acc.) | 5-shot | 81.4 | 89.2 | 72.5 | 90.1 | |
| CMMLU (Acc.) | 5-shot | 84.0 | 89.5 | 73.7 | 88.8 | |
| CMRC (EM) | 1-shot | 77.4 | 75.8 | 76.0 | 76.3 | |
| C3 (Acc.) | 0-shot | 77.4 | 76.7 | 79.7 | 78.6 | |
| CCPM (Acc.) | 0-shot | 93.0 | 88.5 | 78.6 | 92.0 | |
| Multilingual | MMMLU-non-English (Acc.) | 5-shot | 64.0 | 74.8 | 73.8 | 79.4 |
Note: Best results are shown in bold. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3 achieves the best performance on most benchmarks, especially on math and code tasks.
For more evaluation details, please check our paper.
Evaluation results on the "Needle In A Haystack" (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to 128K.
| Benchmark (Metric) | DeepSeek V2-0506 | DeepSeek V2.5-0905 | Qwen2.5 72B-Inst. | LLaMA3.1 405B-Inst. | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | |
|---|---|---|---|---|---|---|---|---|
| Architecture | MoE | MoE | Dense | Dense | - | - | MoE | |
| # Activated Params | 21B | 21B | 72B | 405B | - | - | 37B | |
| # Total Params | 236B | 236B | 72B | 405B | - | - | 671B | |
| English | MMLU (EM) | 78.2 | 80.6 | 85.3 | 88.6 | 88.3 | 87.2 | 88.5 |
| MMLU-Redux (EM) | 77.9 | 80.3 | 85.6 | 86.2 | 88.9 | 88.0 | 89.1 | |
| MMLU-Pro (EM) | 58.5 | 66.2 | 71.6 | 73.3 | 78.0 | 72.6 | 75.9 | |
| DROP (3-shot F1) | 83.0 | 87.8 | 76.7 | 88.7 | 88.3 | 83.7 | 91.6 | |
| IF-Eval (Prompt Strict) | 57.7 | 80.6 | 84.1 | 86.0 | 86.5 | 84.3 | 86.1 | |
| GPQA-Diamond (Pass@1) | 35.3 | 41.3 | 49.0 | 51.1 | 65.0 | 49.9 | 59.1 | |
| SimpleQA (Correct) | 9.0 | 10.2 | 9.1 | 17.1 | 28.4 | 38.2 | 24.9 | |
| FRAMES (Acc.) | 66.9 | 65.4 | 69.8 | 70.0 | 72.5 | 80.5 | 73.3 | |
| LongBench v2 (Acc.) | 31.6 | 35.4 | 39.4 | 36.1 | 41.0 | 48.1 | 48.7 | |
| Code | HumanEval-Mul (Pass@1) | 69.3 | 77.4 | 77.3 | 77.2 | 81.7 | 80.5 | 82.6 |
| LiveCodeBench (Pass@1-COT) | 18.8 | 29.2 | 31.1 | 28.4 | 36.3 | 33.4 | 40.5 | |
| LiveCodeBench (Pass@1) | 20.3 | 28.4 | 28.7 | 30.1 | 32.8 | 34.2 | 37.6 | |
| Codeforces (Percentile) | 17.5 | 35.6 | 24.8 | 25.3 | 20.3 | 23.6 | 51.6 | |
| SWE Verified (Resolved) | - | 22.6 | 23.8 | 24.5 | 50.8 | 38.8 | 42.0 | |
| Aider-Edit (Acc.) | 60.3 | 71.6 | 65.4 | 63.9 | 84.2 | 72.9 | 79.7 | |
| Aider-Polyglot (Acc.) | - | 18.2 | 7.6 | 5.8 | 45.3 | 16.0 | 49.6 | |
| Math | AIME 2024 (Pass@1) | 4.6 | 16.7 | 23.3 | 23.3 | 16.0 | 9.3 | 39.2 |
| MATH-500 (EM) | 56.3 | 74.7 | 80.0 | 73.8 | 78.3 | 74.6 | 90.2 | |
| CNMO 2024 (Pass@1) | 2.8 | 10.8 | 15.9 | 6.8 | 13.1 | 10.8 | 43.2 | |
| Chinese | CLUEWSC (EM) | 89.9 | 90.4 | 91.4 | 84.7 | 85.4 | 87.9 | 90.9 |
| C-Eval (EM) | 78.6 | 79.5 | 86.1 | 61.5 | 76.7 | 76.0 | 86.5 | |
| C-SimpleQA (Correct) | 48.5 | 54.1 | 48.4 | 50.4 | 51.3 | 59.3 | 64.8 |
Note: All models are evaluated in a configuration that limits the output length to 8K. Benchmarks containing fewer than 1000 samples are tested multiple times using varying temperature settings to derive robust final results. DeepSeek-V3 stands as the best-performing open-source model, and also exhibits competitive performance against frontier closed-source models.
| Model | Arena-Hard | AlpacaEval 2.0 |
|---|---|---|
| DeepSeek-V2.5-0905 | 76.2 | 50.5 |
| Qwen2.5-72B-Instruct | 81.2 | 49.1 |
| LLaMA-3.1 405B | 69.3 | 40.5 |
| GPT-4o-0513 | 80.4 | 51.1 |
| Claude-Sonnet-3.5-1022 | 85.2 | 52.0 |
| DeepSeek-V3 | 85.5 | 70.0 |
Note: English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric.
You can chat with DeepSeek-V3 on DeepSeek's official website: chat.deepseek.com
We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com
DeepSeek-V3 can be deployed locally using the following hardware and open-source community software:
Since FP8 training is natively adopted in our framework, we only provide FP8 weights. If you require BF16 weights for experimentation, you can use the provided conversion script to perform the transformation.
Here is an example of converting FP8 weights to BF16:
cd inference
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights注意:目前尚不支持直接使用 Huggingface 的 Transformers。
首先,克隆我们的 DeepSeek-V3 GitHub 仓库:
git clone https://github.com/deepseek-ai/DeepSeek-V3.git进入 inference 文件夹并安装 requirements.txt 中列出的依赖项。
cd DeepSeek-V3/inference
pip install -r requirements.txt从 HuggingFace 下载模型权重,并将它们放入 /path/to/DeepSeek-V3 文件夹中。
将 HuggingFace 模型权重转换为特定格式:
python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 --save-path /path/to/DeepSeek-V3-Demo --n-experts 256 --model-parallel 16然后,您可以与 DeepSeek-V3 聊天:
torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200或者对给定文件进行批量推理:
torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --input-file $FILESGLang 目前支持 MLA 优化、FP8(W8A8)、FP8 KV 缓存和 Torch Compile,在开源框架中提供最先进的延迟和吞吐量性能。
值得注意的是,SGLang v0.4.1 完全支持在 NVIDIA 和 AMD GPU 上运行 DeepSeek-V3,使其成为一个高度通用且强大的解决方案。
以下是 SGLang 团队提供的启动说明:https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3
LMDeploy 是一个灵活高效的推理和服务框架,专为大型语言模型量身打造,现已支持 DeepSeek-V3。它提供离线管道处理和在线部署功能,无缝集成基于 PyTorch 的工作流程。
有关使用 LMDeploy 运行 DeepSeek-V3 的详细逐步说明,请参考此处:https://github.com/InternLM/lmdeploy/issues/2960
TensorRT-LLM 现已支持 DeepSeek-V3 模型,提供 BF16 和 INT4/INT8 权重only 等精度选项。FP8 的支持目前正在进行中,并将很快发布。您可以通过以下链接访问 TRTLLM 的自定义分支,专门针对 DeepSeek-V3 支持,以便直接体验新功能:https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3。
vLLM v0.6.6 支持在 NVIDIA 和 AMD GPU 上针对 FP8 和 BF16 模式进行 DeepSeek-V3 推理。除了标准技术之外,vLLM 还提供 管道并行 功能,使您能够在多台通过网络连接的机器上运行此模型。有关详细指南,请参阅 vLLM 指示。请随时关注 增强计划。
与 AMD 团队的合作,我们实现了对 AMD GPU 的第一天支持,使用 SGLang,完全兼容 FP8 和 BF16 精度。有关详细指南,请参考 SGLang 指示。
华为昇腾社区的 MindIE 框架已成功适配 DeepSeek-V3 的 BF16 版本。有关昇腾 NPU 的逐步指南,请遵循 此处的说明。
此代码存储库受 MIT 许可证 许可。使用 DeepSeek-V3 Base/Chat 模型受 模型许可证 约束。DeepSeek-V3 系列(包括 Base 和 Chat)支持商业用途。
@misc{deepseekai2024deepseekv3technicalreport,
title={DeepSeek-V3 Technical Report},
author={DeepSeek-AI},
year={2024},
eprint={2412.19437},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.19437},
}如果您有任何疑问,请提交问题或通过 service@deepseek.com 联系我们。