2

Say I am using tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True), and all I am doing with that tokenizer during fine-tuning of a new model is the standard tokenizer.encode()

I have seen in most places that people save that tokenizer at the same time that they save their model, but I am unclear on why it's necessary to save since it seems like an out-of-the-box tokenizer that does not get modified in any way during training.

ginobimura
  • 115
  • 1
  • 5

3 Answers3

3

In your case, if you are using tokenizer only to tokenize the text (encode()), then you need not have to save the tokenizer. You can always load the tokenizer of the pretrained model.

However, sometimes you may want to use the tokenizer of the pretrained model, then add new tokens to it's vocabulary, or redefine the special symbols such as '[CLS]', '[MASK]', '[SEP]', '[PAD]' or any such special tokens. In this case, since you have made the changes to the tokenizer, it will be useful to save the tokenizer for the future use.

Ashwin Geet D'Sa
  • 6,346
  • 2
  • 31
  • 59
0

You can always wake up the tokenizer with:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

This may be just part of the routine, that is not so needed.

prosti
  • 42,291
  • 14
  • 186
  • 151
0

Tokenizers create their vocabulary based on the frequency of words (or subwords as in byte pair encoding) in a training corpus. The same tokenizer may have a different vocabulary depending on the corpus on which it is trained.

For this reason you probably want to save the tokenizer after "training" it on a corpus and subsequently training a model that used that tokenizer.

The Huggingface Tokenizer Summary covers how these vocabularies are built up.

JoshVarty
  • 9,066
  • 4
  • 52
  • 80
  • Sure, typically tokenizers get fit on a corpus and their vocab is built up as such, but for BertTokenizer it seems as if the vocab comes pre-built and I don't see any place where it gets modified during the training of the fine tuning layer... Unless the tokenizer.encode() method modifies the tokenizer object somehow? – ginobimura Sep 23 '20 at 20:08