I have a dataset of IDs that are meaningful to me. I want to use language models to generate IDs based on a few IDs that I give as a starting point. Let's say my dataset is like a sequence of IDs in each line separated by whitespace, more like:
[start] id1 id3 id6 id9 id1 id5 [end]
[start] id1 id2 id89 id36 id66 id19 id21 ... id75 [end]
...
So first I need to train a tokenizer on words (IDs), not on subwords or bytes, then train a generative model to generate a sequence. I mostly want to use BART with a little bit smaller config/architecture.
Here is the code I found to train the tokenizer but I do not know if it will integrate with BART.
from tokenizers.trainers import WordLevelTrainer
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.pre_tokenizers import Whitespace
trainer = WordLevelTrainer(special_tokens = ["[start]", "[end]"], show_progress=True)
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train([raw_sentenses_file_path], trainer=trainer)
tokenizer.save("model/tokenizer/tokenizer.json")
Also I found this code for creating a Generative model:
configuration = BartConfig(
vocab_size=11500,
max_position_embeddings=258,
d_model=256,
encoder_layers=1,
decoder_layers=1,
encoder_attention_heads=2,
decoder_attention_heads=2,
decoder_ffn_dim=512,
encoder_ffn_dim=512,
)
model = BartForCausalLM(configuration)
model.to(device)
Then I think I should create a Seq2SeqTrainer
and train the model.
Also, I read something that instead of BartForCausalLM
we can use the BartForConditionalGeneration
class from HF.
So basically I need to integrate them and get predictions that I do not know how. If the codes are not correct then what should I use? Is there any easier way to do this?
I am using huggingface with PyTorch.