Any reason to save a pretrained BERT tokenizer?

Question

Say I am using tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True), and all I am doing with that tokenizer during fine-tuning of a new model is the standard tokenizer.encode()

I have seen in most places that people save that tokenizer at the same time that they save their model, but I am unclear on why it's necessary to save since it seems like an out-of-the-box tokenizer that does not get modified in any way during training.

score 3 · Accepted Answer · answered Sep 24 '20 at 08:18

3

In your case, if you are using tokenizer only to tokenize the text (encode()), then you need not have to save the tokenizer. You can always load the tokenizer of the pretrained model.

However, sometimes you may want to use the tokenizer of the pretrained model, then add new tokens to it's vocabulary, or redefine the special symbols such as '[CLS]', '[MASK]', '[SEP]', '[PAD]' or any such special tokens. In this case, since you have made the changes to the tokenizer, it will be useful to save the tokenizer for the future use.

answered Sep 24 '20 at 08:18

Ashwin Geet D'Sa

6,346
2
31
59

so what method you will use to save it for later use? – gia huy Dec 11 '21 at 03:37
https://stackoverflow.com/questions/64550503/huggingface-saving-tokenizer/64552678#64552678 – Ashwin Geet D'Sa Dec 11 '21 at 14:15

score 0 · Answer 2 · answered Sep 23 '20 at 00:05

0

You can always wake up the tokenizer with:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

This may be just part of the routine, that is not so needed.

answered Sep 23 '20 at 00:05

prosti

42,291
14
186
151

score 0 · Answer 3 · answered Sep 23 '20 at 00:47

0

Tokenizers create their vocabulary based on the frequency of words (or subwords as in byte pair encoding) in a training corpus. The same tokenizer may have a different vocabulary depending on the corpus on which it is trained.

For this reason you probably want to save the tokenizer after "training" it on a corpus and subsequently training a model that used that tokenizer.

The Huggingface Tokenizer Summary covers how these vocabularies are built up.

answered Sep 23 '20 at 00:47

JoshVarty

9,066
4
52
80

Sure, typically tokenizers get fit on a corpus and their vocab is built up as such, but for BertTokenizer it seems as if the vocab comes pre-built and I don't see any place where it gets modified during the training of the fine tuning layer... Unless the tokenizer.encode() method modifies the tokenizer object somehow? – ginobimura Sep 23 '20 at 20:08

Any reason to save a pretrained BERT tokenizer?

3 Answers3