这是 HPLT 项目 首次发布的仅编码器型单语语言模型之一。该模型属于掩码语言模型,具体而言,我们采用了经典 BERT 模型的改进版本,即 LTG-BERT。
针对 HPLT 1.2 数据版本 中的每一种主要语言,我们都训练了一个单语 LTG-BERT 模型(总共 75 个模型)。
所有 HPLT 仅编码器模型均采用相同的超参数,大致遵循 BERT-base 的配置:
每个模型都使用其专属的分词器,该分词器是在特定语言的 HPLT 数据上训练得到的。有关训练语料库的规模、评估结果等更多信息,请参阅我们的 语言模型训练报告。
训练代码。
该模型目前需要来自 modeling_ltgbert.py 的自定义封装器,因此加载模型时应设置 trust_remote_code=True。
from huggingface_hub import list_repo_refs
from transformers import AutoModelForMaskedLM
import torch
import torch_npu
import os
import argparse
from openmind import pipeline, is_torch_npu_available, AutoTokenizer
from openmind_hub import snapshot_download
if is_torch_npu_available():
device = "npu:0"
else:
device = "cpu"
parser = argparse.ArgumentParser()
parser.add_argument("--model_name_or_path", type=str, default="./")
args = parser.parse_args()
if args.model_name_or_path:
model_path = args.model_name_or_path
else:
model_path = snapshot_download(
"CICC/hplt_bert_base_lv",
revision="main",
resume_download=True,
ignore_patterns=["*.h5", "*.ot", " *.msgpack"]
)
tokenizer = AutoTokenizer.from_pretrained('./')
model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.float16)
mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
input_text = tokenizer("It's a beautiful[MASK].", return_tensors="pt").to(device)
model = model.to(device)
output_p = model(**input_text)
output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)
# should output: '[CLS] It's a beautiful place.[SEP]'
print(tokenizer.decode(output_text[0].tolist()))目前已实现以下类:AutoModel、AutoModelMaskedLM、AutoModelForSequenceClassification、AutoModelForTokenClassification、AutoModelForQuestionAnswering 和 AutoModeltForMultipleChoice。
我们在单独的分支中为每个模型发布 10 个中间检查点,间隔为每 3125 个训练步骤。命名规则为 stepXXX:例如,step18750。
您可以使用 transformers 通过参数 revision 加载特定的模型版本:
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_en", revision="step21875", trust_remote_code=True)@inproceedings{de-gibert-etal-2024-new-massive,
title = "A New Massive Multilingual Dataset for High-Performance Language Technologies",
author = {de Gibert, Ona and
Nail, Graeme and
Arefyev, Nikolay and
Ba{\~n}{\'o}n, Marta and
van der Linde, Jelmer and
Ji, Shaoxiong and
Zaragoza-Bernabeu, Jaume and
Aulamo, Mikko and
Ram{\'\i}rez-S{\'a}nchez, Gema and
Kutuzov, Andrey and
Pyysalo, Sampo and
Oepen, Stephan and
Tiedemann, J{\"o}rg},
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.100",
pages = "1116--1128",
abstract = "We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of {\mbox{$\approx$}} 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.",
}