How to train a language model for my data

Question

I have a dataset of IDs that are meaningful to me. I want to use language models to generate IDs based on a few IDs that I give as a starting point. Let's say my dataset is like a sequence of IDs in each line separated by whitespace, more like:

[start] id1 id3 id6 id9 id1 id5 [end]
[start] id1 id2 id89 id36 id66 id19 id21 ... id75 [end]
...

So first I need to train a tokenizer on words (IDs), not on subwords or bytes, then train a generative model to generate a sequence. I mostly want to use BART with a little bit smaller config/architecture.

Here is the code I found to train the tokenizer but I do not know if it will integrate with BART.

from tokenizers.trainers import WordLevelTrainer
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.pre_tokenizers import Whitespace
trainer = WordLevelTrainer(special_tokens = ["[start]", "[end]"], show_progress=True)
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train([raw_sentenses_file_path], trainer=trainer)

tokenizer.save("model/tokenizer/tokenizer.json")

Also I found this code for creating a Generative model:

configuration = BartConfig(
    vocab_size=11500,
    max_position_embeddings=258,
    d_model=256,
    encoder_layers=1,
    decoder_layers=1,
    encoder_attention_heads=2,
    decoder_attention_heads=2,
    decoder_ffn_dim=512,
    encoder_ffn_dim=512,
)

model = BartForCausalLM(configuration)

model.to(device)

Then I think I should create a Seq2SeqTrainer and train the model. Also, I read something that instead of BartForCausalLM we can use the BartForConditionalGeneration class from HF.

So basically I need to integrate them and get predictions that I do not know how. If the codes are not correct then what should I use? Is there any easier way to do this?

I am using huggingface with PyTorch.

How to train a language model for my data

0 Answers0