C
CICC/hplt_bert_base_lv
模型介绍文件和版本分析
下载使用量0

拉脱维亚语 HPLT Bert 模型

这是 HPLT 项目 首次发布的仅编码器型单语语言模型之一。该模型属于掩码语言模型,具体而言,我们采用了经典 BERT 模型的改进版本,即 LTG-BERT。

针对 HPLT 1.2 数据版本 中的每一种主要语言,我们都训练了一个单语 LTG-BERT 模型(总共 75 个模型)。

所有 HPLT 仅编码器模型均采用相同的超参数,大致遵循 BERT-base 的配置:

  • 隐藏层大小:768
  • 注意力头数:12
  • 层数:12
  • 词汇表大小:32768

每个模型都使用其专属的分词器,该分词器是在特定语言的 HPLT 数据上训练得到的。有关训练语料库的规模、评估结果等更多信息,请参阅我们的 语言模型训练报告。

训练代码。

全部 75 次运行的训练统计数据

示例用法

该模型目前需要来自 modeling_ltgbert.py 的自定义封装器,因此加载模型时应设置 trust_remote_code=True。

from huggingface_hub import list_repo_refs
from transformers import AutoModelForMaskedLM
import torch
import torch_npu
import os
import argparse
from openmind import pipeline, is_torch_npu_available, AutoTokenizer
from openmind_hub import snapshot_download

if is_torch_npu_available():
    device = "npu:0"
else:
    device = "cpu"

parser = argparse.ArgumentParser()
parser.add_argument("--model_name_or_path", type=str, default="./")
args = parser.parse_args()

if args.model_name_or_path:
	model_path = args.model_name_or_path
else:
	model_path = snapshot_download(
		"CICC/hplt_bert_base_lv",
		revision="main",
		resume_download=True,
		ignore_patterns=["*.h5", "*.ot", "	*.msgpack"]
	)

tokenizer = AutoTokenizer.from_pretrained('./')
model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.float16)

mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
input_text = tokenizer("It's a beautiful[MASK].", return_tensors="pt").to(device)
model = model.to(device)
output_p = model(**input_text)
output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)

# should output: '[CLS] It's a beautiful place.[SEP]'
print(tokenizer.decode(output_text[0].tolist()))

目前已实现以下类:AutoModel、AutoModelMaskedLM、AutoModelForSequenceClassification、AutoModelForTokenClassification、AutoModelForQuestionAnswering 和 AutoModeltForMultipleChoice。

中间检查点

我们在单独的分支中为每个模型发布 10 个中间检查点,间隔为每 3125 个训练步骤。命名规则为 stepXXX:例如,step18750。

您可以使用 transformers 通过参数 revision 加载特定的模型版本:

model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_en", revision="step21875", trust_remote_code=True)

引用我们

@inproceedings{de-gibert-etal-2024-new-massive,
    title = "A New Massive Multilingual Dataset for High-Performance Language Technologies",
    author = {de Gibert, Ona  and
      Nail, Graeme  and
      Arefyev, Nikolay  and
      Ba{\~n}{\'o}n, Marta  and
      van der Linde, Jelmer  and
      Ji, Shaoxiong  and
      Zaragoza-Bernabeu, Jaume  and
      Aulamo, Mikko  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Kutuzov, Andrey  and
      Pyysalo, Sampo  and
      Oepen, Stephan  and
      Tiedemann, J{\"o}rg},
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.100",
    pages = "1116--1128",
    abstract = "We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of {\mbox{$\approx$}} 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.",
}