无限指令集

北京智源人工智能研究院
[论文][代码][🤗]（即将发布）

指令数据的质量与规模对模型性能至关重要。当前开源模型普遍依赖包含数百万样本的微调数据集，这对数据的质与量都提出了极高要求。然而，构建如此大规模的高质量指令微调数据集成本高昂，长期制约着开源社区的相关研究和应用。为此，我们推出无限指令集项目，致力于打造一个规模宏大、质量卓越的指令数据集。

GPT-4自动评估

模型	MT-Bench	AlpacaEval2.0	竞技场-hard
GPT-4-全能版	--	57.5	74.9
GPT-4-1106	9.3	50.0	--
GPT-4-0314	9.0	35.3	50.0
GPT-4-0613	9.2	30.2	37.9
Gemini Pro	--	24.4	17.8
Mixtral 8x7B v0.1	8.3	23.7	23.4
Mistral-7B指令版v0.2	7.6	17.1	--
无限指令-3M-0613-Mistral-7B	8.1	25.5	--
无限指令-3M-0625-Mistral-7B	8.1	31.4	--
无限指令-7M-Gen-Mistral-7B	8.1	40.0	26.9
Llama-3-70B指令版	9.0	34.4	46.6
Llama-3.1-8B指令版	--	20.9	20.6
Llama-3.1-70B指令版	--	38.1	55.7
Llama-3.1-405B指令版	--	39.3	64.1
无限指令-7M-Gen-Llama-3.1-8B	8.2	33.9	30.4
无限指令-3M-0613-Llama-3-70B	8.7	31.5	--
无限指令-3M-0625-Llama-3-70B	8.9	38.0	--
无限指令-7M-Gen-Llama-3.1-70B	8.9	46.1	66.0

下游任务表现

模型	MMLU	GSM8K	HumanEval	HellaSwag	平均分
GPT-3.5	70	57.1	48.1	85.5	65.2
GPT-4	86.4	92.0	67.0	95.3	85.2
Mistral-7B	56.5	48.1	14.0	35.5	38.5
Mistral-7B指令版v0.2	59.6	45.9	32.9	64.4	50.7
OpenHermes-2.5-Mistral-7B	61.7	73.0	41.5	80.6	64.2
无限指令-3M-Mistral-7B	62.9	78.1	50.6	84.8	69.1
无限指令-7M-Mistral-7B	65.0	78.6	59.8	90.0	73.4
无限指令-7M-Llama3.1-70B	79.1	88.0	72.0	94.6	83.4

Infinity Instruct 概述

为构建千万级高质量指令数据集，我们收集了大量开源数据作为种子，并通过指令筛选与指令进化双策略进行数据迭代。参考[3]，我们推荐使用基础数据集（Foundational Dataset）——该数据集包含从开源数据中精选的数百万条指令，可显著提升模型在代码、数学等高难度下游任务中的表现。同时，建议采用聊天数据集（Chat Dataset）——该数据集包含约100万条由高质量种子数据进化而来的指令，能有效增强模型在真实对话场景中的指令遵循能力。数据集版本信息如下：

数据集类别	数据集版本	指令数量
基础数据集	InfInstruct-3M	3,463,473
基础数据集	InfInstruct-7M	7,449,106
聊天数据集	InfInstruct-0613	362,330
	InfInstruct-0625	659,808
	InfInstruct-Gen (0729)	1,456,927

使用方法

可通过以下代码加载Infinity-Instruct数据集及模型：

##数据集下载
from datasets import load_dataset
dataset_7M = load_dataset('BAAI/Infinity-Instruct','7M',split='train')
dataset_Gen = load_dataset('BAAI/Infinity-Instruct','Gen',split='train')

##模型下载
from transformers import AutoModelForCausalLM, AutoTokenizer
model_llama3_1_70B = AutoModelForCausalLM.from_pretrained("BAAI/Infinity-Instruct-7M-Gen-Llama3_1-70B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer_llama3_1_70B = AutoTokenizer.from_pretrained("BAAI/Infinity-Instruct-7M-Gen-Llama3_1-70B")

model_mistral_7B = AutoModelForCausalLM.from_pretrained("BAAI/Infinity-Instruct-7M-Gen-Mistral-7B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

tokenizer_mistral_7B = AutoTokenizer.from_pretrained("BAAI/Infinity-Instruct-7M-Gen-Mistral-7B")

Infinity-Instruct 为每条指令数据标注了语言、能力类型、来源等信息：

{"id": "3023f510b6ea9af835492b46dc171c3f",
"conversations": [{"from": "human", "value": "What is the biological term for Magic Mushrooms? Psilocybin mushrooms, commonly known as magic mushrooms, are a polyphyletic informal group of fungi that contain psilocybin which turns into psilocin upon ingestion. Biological genera containing psilocybin mushrooms include Psilocybe, Panaeolus (including Copelandia), Inocybe, Pluteus, Gymnopilus, and Pholiotina. Psilocybin mushrooms have been and continue to be used in indigenous New World cultures in religious, divinatory, or spiritual contexts. Psilocybin mushrooms are also used as recreational drugs. They may be depicted in Stone Age rock art in Africa and Europe but are most famously represented in the Pre-Columbian sculptures and glyphs seen throughout North, Central, and South America."}, {"from": "gpt", "value": "Magic Mushrooms are the colloquial term for Psilocybin mushrooms"}],
"label": {
"ability_en": ["fact checking", "knowledge query"],
"ability_zh": ["事实查询", "知识查询"],
"cate_ability_zh": ["信息处理与整合"],
"cate_ability_en": ["information processing and integration"]},
"langdetect": "en",
"source": "Subjective"}

您可以根据这些标签构建符合自身需求的数据子集。

基于Infinity-Instruct进行模型微调时，我们推荐使用以下训练超参数配置：

数据来源

去重后的Infinity-Instruct-7M数据集详情如下表所示：

原始数据集	数据量（行）
glaiveai/glaive-code-assistant-v3	9,281
Replete-AI/code_bagel_hermes-2.5	386,649
m-a-p/CodeFeedback-Filtered-Instruction	60,735
bigcode/self-oss-instruct-sc2-exec-filter-50k	50,467
codefuse-ai/CodeExercise-Python-27k	27,159
nickrosh/Evol-Instruct-Code-80k-v1	43,354
jinaai/code_exercises	590,958
TokenBender/code_instructions_122k_alpaca_style	23,130
iamtarun/python_code_instructions_18k_alpaca	2,581
Nan-Do/instructional_code-search-net-python	82,920
Safurai/Code-Instruct-700k	10,860
ajibawa-2023/Python-Code-23k-ShareGPT	2,297
jtatman/python-code-dataset-500k	88,632
m-a-p/Code-Feedback	79,513
TIGER-Lab/MathInstruct	329,254
microsoft/orca-math-word-problems-200k	398,168
MetaMathQa	690,138
teknium/Openhermes-2.5	855,478
google/flan	2,435,840
精选主观指令集	1,342,427
总计	7,449,106

主观指令来源及数量：

原始数据集	数据量（行）
Alpaca GPT4 数据	13,490
Alpaca GPT4 中文数据	32,589
Baize	14,906
BELLE 生成对话	43,775
BELLE 多轮对话	210,685
BELLE 3.5M 中文数据集	312,598
databricks-dolly-15K	10,307
LIMA-sft	712
CodeContest	523
LongForm	3,290
ShareGPT中英双语90k	8,919
UltraChat	237,199
Wizard中文进化指令	44,738
Wizard进化指令196K	88,681
BELLE 学校数学题	38,329
Code Alpaca 20K	13,296
WildChat	61,873
COIG-CQIA	45,793
BAGEL	55,193
DEITA	10,000
总计	1,342,427

主观指令类别的领域分布如下图所示：

下游任务指令选择方法

为建立客观排序机制，我们采用Flan和OpenHermes等数据集，重点强化代码与数学能力。具体方法包括：对评估集进行细粒度主题分布标注（如humaneval中的数据结构、排序等）；基于数据集来源应用启发式规则过滤无关数据（如移除网络或文件I/O操作）；根据验证集分布从训练集中检索对应子集。

高质量响应指令生成流程

高质量开源指令收集与标签体系

我们首先系统收集高质量开源指令集，并为每条指令标注完成所需能力与知识标签。该体系可清晰呈现指令集内容分布及任务完成所需能力维度。

指令收集：系统梳理现有开源指令集，涵盖人工编写与先进大语言模型生成的优质数据
两级标签体系：
- 一级标签：描述完成指令所需具体知识与能力（如算术运算、生物学知识），由大模型自动生成
- 二级标签：宏观领域分类（如自然语言处理、数学推理），共25个类别

信息量最大化指令选择

旨在从全量指令中筛选最具信息量的指令，以提升大模型性能并优化用户体验。

高信息量指令标准：
- 需要多能力/多领域协同完成的指令（通过标签体系识别）
- 涉及长尾能力或知识的指令
- 高跟随难度指令（采用Li等人[1]的方法量化）

基于数据进化策略的指令生成

基于[2]的方法在广度、深度、难度、复杂度四个维度扩展种子指令，并通过AI助手生成多轮对话数据。

基于前节筛选的元数据，在Evol-Instruct方法基础上随机选择扩展维度
验证进化数据，从指令合规性角度通过AI助手淘汰不合格数据
以进化指令为初始输入，通过AI助手角色扮演生成2-4轮对话

基于模型能力诊断的指令生成

通过自动识别模型能力短板指导数据合成。

模型性能评估系统：由常用评估集构成
自动能力缺陷诊断：基于标准答案与模型输出，通过AI助手归纳短板
针对性数据合成：根据诊断结果通过AI助手自动生成新指令

参考文献

[1] 李明，张洋，贺松等. 超级过滤：面向快速指令调优的弱监督到强监督数据过滤方法[J]. arXiv预印本 arXiv:2402.00530, 2024.

[2] 许晨，孙琦，郑凯等. WizardLM：赋能大型预训练语言模型执行复杂指令[C]//第十二届国际学习表征会议. 2023.

[3] 张冠，曲松，刘杰等. Map-neo：高性能透明双语大语言模型系列[J]. arXiv预印本 arXiv:2405.19327, 2024.

引用声明

我们详细介绍Infinity Instruct数据集开发与特性的论文即将在arXiv平台发布，敬请期待！

@article{InfinityInstruct2024,
  title={Infinity Instruct},
  author={Beijing Academy of Artificial Intelligence (BAAI)},
  journal={arXiv preprint arXiv:2406.XXXX},
  year={2024}
}

@article{zhao2024iidoptimizinginstructionlearning,
      title={Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and Dependency}, 
      author={Hanyu Zhao and Li Du and Yiming Ju and Chengwei Wu and Tengfei Pan},
      year={2024},
      eprint={2409.07045},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.07045},
}

@misc{zhang2024inifinitymath,
      title={InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning}, 
      author={Bo-Wen Zhang and Yan Yan and Lin Li and Guang Liu},
      year={2024},
      eprint={2408.07089},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.07089}, 
}

无限指令集

北京智源人工智能研究院
[论文][代码][🤗]（即将发布）

GPT-4自动评估

模型	MT-Bench	AlpacaEval2.0	竞技场-hard
GPT-4-全能版	--	57.5	74.9
GPT-4-1106	9.3	50.0	--
GPT-4-0314	9.0	35.3	50.0
GPT-4-0613	9.2	30.2	37.9
Gemini Pro	--	24.4	17.8
Mixtral 8x7B v0.1	8.3	23.7	23.4
Mistral-7B指令版v0.2	7.6	17.1	--
无限指令-3M-0613-Mistral-7B	8.1	25.5	--
无限指令-3M-0625-Mistral-7B	8.1	31.4	--
无限指令-7M-Gen-Mistral-7B	8.1	40.0	26.9
Llama-3-70B指令版	9.0	34.4	46.6
Llama-3.1-8B指令版	--	20.9	20.6
Llama-3.1-70B指令版	--	38.1	55.7
Llama-3.1-405B指令版	--	39.3	64.1
无限指令-7M-Gen-Llama-3.1-8B	8.2	33.9	30.4
无限指令-3M-0613-Llama-3-70B	8.7	31.5	--
无限指令-3M-0625-Llama-3-70B	8.9	38.0	--
无限指令-7M-Gen-Llama-3.1-70B	8.9	46.1	66.0

下游任务表现

模型	MMLU	GSM8K	HumanEval	HellaSwag	平均分
GPT-3.5	70	57.1	48.1	85.5	65.2
GPT-4	86.4	92.0	67.0	95.3	85.2
Mistral-7B	56.5	48.1	14.0	35.5	38.5
Mistral-7B指令版v0.2	59.6	45.9	32.9	64.4	50.7
OpenHermes-2.5-Mistral-7B	61.7	73.0	41.5	80.6	64.2
无限指令-3M-Mistral-7B	62.9	78.1	50.6	84.8	69.1
无限指令-7M-Mistral-7B	65.0	78.6	59.8	90.0	73.4
无限指令-7M-Llama3.1-70B	79.1	88.0	72.0	94.6	83.4

Infinity Instruct 概述

数据集类别	数据集版本	指令数量
基础数据集	InfInstruct-3M	3,463,473
基础数据集	InfInstruct-7M	7,449,106
聊天数据集	InfInstruct-0613	362,330
	InfInstruct-0625	659,808
	InfInstruct-Gen (0729)	1,456,927

使用方法

可通过以下代码加载Infinity-Instruct数据集及模型：

##数据集下载
from datasets import load_dataset
dataset_7M = load_dataset('BAAI/Infinity-Instruct','7M',split='train')
dataset_Gen = load_dataset('BAAI/Infinity-Instruct','Gen',split='train')

##模型下载
from transformers import AutoModelForCausalLM, AutoTokenizer
model_llama3_1_70B = AutoModelForCausalLM.from_pretrained("BAAI/Infinity-Instruct-7M-Gen-Llama3_1-70B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer_llama3_1_70B = AutoTokenizer.from_pretrained("BAAI/Infinity-Instruct-7M-Gen-Llama3_1-70B")

model_mistral_7B = AutoModelForCausalLM.from_pretrained("BAAI/Infinity-Instruct-7M-Gen-Mistral-7B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

tokenizer_mistral_7B = AutoTokenizer.from_pretrained("BAAI/Infinity-Instruct-7M-Gen-Mistral-7B")

Infinity-Instruct 为每条指令数据标注了语言、能力类型、来源等信息：

{"id": "3023f510b6ea9af835492b46dc171c3f",
"conversations": [{"from": "human", "value": "What is the biological term for Magic Mushrooms? Psilocybin mushrooms, commonly known as magic mushrooms, are a polyphyletic informal group of fungi that contain psilocybin which turns into psilocin upon ingestion. Biological genera containing psilocybin mushrooms include Psilocybe, Panaeolus (including Copelandia), Inocybe, Pluteus, Gymnopilus, and Pholiotina. Psilocybin mushrooms have been and continue to be used in indigenous New World cultures in religious, divinatory, or spiritual contexts. Psilocybin mushrooms are also used as recreational drugs. They may be depicted in Stone Age rock art in Africa and Europe but are most famously represented in the Pre-Columbian sculptures and glyphs seen throughout North, Central, and South America."}, {"from": "gpt", "value": "Magic Mushrooms are the colloquial term for Psilocybin mushrooms"}],
"label": {
"ability_en": ["fact checking", "knowledge query"],
"ability_zh": ["事实查询", "知识查询"],
"cate_ability_zh": ["信息处理与整合"],
"cate_ability_en": ["information processing and integration"]},
"langdetect": "en",
"source": "Subjective"}

您可以根据这些标签构建符合自身需求的数据子集。

基于Infinity-Instruct进行模型微调时，我们推荐使用以下训练超参数配置：

数据来源

去重后的Infinity-Instruct-7M数据集详情如下表所示：

原始数据集	数据量（行）
glaiveai/glaive-code-assistant-v3	9,281
Replete-AI/code_bagel_hermes-2.5	386,649
m-a-p/CodeFeedback-Filtered-Instruction	60,735
bigcode/self-oss-instruct-sc2-exec-filter-50k	50,467
codefuse-ai/CodeExercise-Python-27k	27,159
nickrosh/Evol-Instruct-Code-80k-v1	43,354
jinaai/code_exercises	590,958
TokenBender/code_instructions_122k_alpaca_style	23,130
iamtarun/python_code_instructions_18k_alpaca	2,581
Nan-Do/instructional_code-search-net-python	82,920
Safurai/Code-Instruct-700k	10,860
ajibawa-2023/Python-Code-23k-ShareGPT	2,297
jtatman/python-code-dataset-500k	88,632
m-a-p/Code-Feedback	79,513
TIGER-Lab/MathInstruct	329,254
microsoft/orca-math-word-problems-200k	398,168
MetaMathQa	690,138
teknium/Openhermes-2.5	855,478
google/flan	2,435,840
精选主观指令集	1,342,427
总计	7,449,106

主观指令来源及数量：

原始数据集	数据量（行）
Alpaca GPT4 数据	13,490
Alpaca GPT4 中文数据	32,589
Baize	14,906
BELLE 生成对话	43,775
BELLE 多轮对话	210,685
BELLE 3.5M 中文数据集	312,598
databricks-dolly-15K	10,307
LIMA-sft	712
CodeContest	523
LongForm	3,290
ShareGPT中英双语90k	8,919
UltraChat	237,199
Wizard中文进化指令	44,738
Wizard进化指令196K	88,681
BELLE 学校数学题	38,329
Code Alpaca 20K	13,296
WildChat	61,873
COIG-CQIA	45,793
BAGEL	55,193
DEITA	10,000
总计	1,342,427

主观指令类别的领域分布如下图所示：

下游任务指令选择方法

高质量响应指令生成流程

高质量开源指令收集与标签体系

我们首先系统收集高质量开源指令集，并为每条指令标注完成所需能力与知识标签。该体系可清晰呈现指令集内容分布及任务完成所需能力维度。

指令收集：系统梳理现有开源指令集，涵盖人工编写与先进大语言模型生成的优质数据
两级标签体系：
- 一级标签：描述完成指令所需具体知识与能力（如算术运算、生物学知识），由大模型自动生成
- 二级标签：宏观领域分类（如自然语言处理、数学推理），共25个类别

信息量最大化指令选择

旨在从全量指令中筛选最具信息量的指令，以提升大模型性能并优化用户体验。

高信息量指令标准：
- 需要多能力/多领域协同完成的指令（通过标签体系识别）
- 涉及长尾能力或知识的指令
- 高跟随难度指令（采用Li等人[1]的方法量化）

基于数据进化策略的指令生成

基于[2]的方法在广度、深度、难度、复杂度四个维度扩展种子指令，并通过AI助手生成多轮对话数据。

基于前节筛选的元数据，在Evol-Instruct方法基础上随机选择扩展维度
验证进化数据，从指令合规性角度通过AI助手淘汰不合格数据
以进化指令为初始输入，通过AI助手角色扮演生成2-4轮对话

基于模型能力诊断的指令生成

通过自动识别模型能力短板指导数据合成。

模型性能评估系统：由常用评估集构成
自动能力缺陷诊断：基于标准答案与模型输出，通过AI助手归纳短板
针对性数据合成：根据诊断结果通过AI助手自动生成新指令

参考文献

[1] 李明，张洋，贺松等. 超级过滤：面向快速指令调优的弱监督到强监督数据过滤方法[J]. arXiv预印本 arXiv:2402.00530, 2024.

[2] 许晨，孙琦，郑凯等. WizardLM：赋能大型预训练语言模型执行复杂指令[C]//第十二届国际学习表征会议. 2023.

[3] 张冠，曲松，刘杰等. Map-neo：高性能透明双语大语言模型系列[J]. arXiv预印本 arXiv:2405.19327, 2024.

引用声明

我们详细介绍Infinity Instruct数据集开发与特性的论文即将在arXiv平台发布，敬请期待！

@article{InfinityInstruct2024,
  title={Infinity Instruct},
  author={Beijing Academy of Artificial Intelligence (BAAI)},
  journal={arXiv preprint arXiv:2406.XXXX},
  year={2024}
}

@article{zhao2024iidoptimizinginstructionlearning,
      title={Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and Dependency}, 
      author={Hanyu Zhao and Li Du and Yiming Ju and Chengwei Wu and Tengfei Pan},
      year={2024},
      eprint={2409.07045},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.07045},
}

@misc{zhang2024inifinitymath,
      title={InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning}, 
      author={Bo-Wen Zhang and Yan Yan and Lin Li and Guang Liu},
      year={2024},
      eprint={2408.07089},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.07089}, 
}

无限指令集

最新动态

GPT-4自动评估

下游任务表现

Infinity Instruct 概述

使用方法

数据来源

下游任务指令选择方法

高质量响应指令生成流程

高质量开源指令收集与标签体系

信息量最大化指令选择

基于数据进化策略的指令生成

基于模型能力诊断的指令生成

参考文献

引用声明

无限指令集

最新动态

GPT-4自动评估

下游任务表现

Infinity Instruct 概述

使用方法

数据来源

下游任务指令选择方法

高质量响应指令生成流程

高质量开源指令收集与标签体系

信息量最大化指令选择

基于数据进化策略的指令生成

基于模型能力诊断的指令生成

参考文献

引用声明