0

I'm following this tutorial on training a causal language model from scratch. My dataset is a corpus of text:

my_dataset = ["some_text... _112_ some_text... _113_ some_text... _114_ some_text...", "some_text... _1423_ some_text... _1424_ some_text... _1425_ some_text...", "some_text... _1111_ some_text... _1111_ some_text... _1111_ some_text..."]. 

The issue is that my dataset contains a clear pattern of numbers in each text (either the numbers are consecutive or they repeat).

I would like to mask out the previous predicted numbers in this pattern as the model predicts the next tokens (note that they always has the pattern of _X_, where X is a number, so I don't want to just mask out any previous number, but just those that correspond to the pattern).

For example, given the first text, after the model predicts _112_, I'd like to mask the number 112 in that sequence for the subsequent token predictions (e.g., "some_text... _MaskToken_ some_text...").

I found this SO that I believe asked a similar question a couple of years ago, but left unanswered and used an inefficient method therefore. From the tutorial I'm using it seems like the DataCollatorForLanguageModeling collator might be the way to go about this:

"Besides stacking and padding batches, it also takes care of creating the language model labels — in causal language modeling the inputs serve as labels too (just shifted by one element), and this data collator creates them on the fly during training so we don’t need to duplicate the input_ids."

From this reddit post I understand that the DataCollatorForLanguageModelling

"Duplicate the training sentence. If the masking is performed every time a sequence is fed to the model, the model sees different versions of the same sentence with masks on different positions."

The tutorial also mention

Shifting the inputs and labels to align them happens inside the model, so the data collator just copies the inputs to create the labels.

But going over the source code of GPT2LMHeadModel or the data collator it is not clear to me how to do this either.

Penguin
  • 1,923
  • 3
  • 21
  • 51

1 Answers1

0

You could create a new class that implements a mask pattern:

from transformers import DataCollatorForLanguageModeling, PreTrainedTokenizerBase

class MyDataCollatorForLanguageModeling(DataCollatorForLanguageModeling):
    def __init__(self, tokenizer: PreTrainedTokenizerBase, pattern="_[0-9]+_"):
        super().__init__(tokenizer=tokenizer)
        self.pattern = pattern
        
    def mask_pattern(self, tokens):
        output_tokens = []
        mask_token_id = self.tokenizer.mask_token_id
        for token in tokens:
            if isinstance(token, int) and re.match(self.pattern, self.tokenizer.decode([token])):
                output_tokens.append(mask_token_id)
            else:
                output_tokens.append(token)
        return output_tokens
        
    def __call__(self, examples):
        batch = super().__call__(examples)
        batch["input_ids"] = [self.mask_pattern(tokens) for tokens in batch["input_ids"]]
        return batch

Then you can use it like so (update with any other required imports):

from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

dataset = ["some_text... _112_ some_text... _113_ some_text... _114_ some_text...",
           "some_text... _1423_ some_text... _1424_ some_text... _1425_ some_text...",
           "some_text... _1111_ some_text... _1111_ some_text... _1111_ some_text..."]

data_collator = MyDataCollatorForLanguageModeling(tokenizer=tokenizer)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=2,
    save_steps=500,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=data_collator,
)

trainer.train()
Phong Phuong
  • 369
  • 4
  • 8
  • Thanks for the help! But wouldn't this just mask all the patterns? – Penguin Mar 08 '23 at 15:53
  • No, this code is designed to only mask the specific pattern of tokens that match the regular expression provided in the pattern argument. In the code, the mask_pattern function checks each token in the input sequence and only replaces those that match the pattern with the mask token ID. The input sequence is then returned with the masked tokens. – Phong Phuong Mar 08 '23 at 17:37
  • Also, you will need to create a new instance and pass in the pattern that you want to match after each prediction. data_collator = MyDataCollatorForLanguageModeling(tokenizer, pattern="_[0-9]+X_") – Phong Phuong Mar 08 '23 at 17:45
  • Sorry, that's what I meant. It will mask out all of this specific pattern, and not "previous predicted numbers in this pattern as the model predicts the next tokens" as I asked – Penguin Mar 08 '23 at 23:48
  • You have to provide what your pattern is. The code is an example and you will have tailor to your specific requirements. – Phong Phuong Mar 09 '23 at 00:45
  • I did provide the pattern though. It's any number between `_XX_` as mentioned above. And as I explain, your code doesn't work for my problem because it masks ~all~ of the tokens in that pattern, and not only those that the model already seen as I described – Penguin Mar 09 '23 at 03:36
  • You'll need to read up on how to use regex patterns – Phong Phuong Mar 09 '23 at 07:56
  • For example, passing in a pattern of "^_112_$" will match the value "_112_" exactly, and only mask it if matches. – Phong Phuong Mar 09 '23 at 08:11