How does max_length, padding and truncation arguments work in HuggingFace' BertTokenizerFast.from_pretrained('bert-base-uncased')?

Question

I am working with Text Classification problem where I want to use the BERT model as the base followed by Dense layers. I want to know how does the 3 arguments work? For example, if I have 3 sentences as:

'My name is slim shade and I am an aspiring AI Engineer',
'I am an aspiring AI Engineer',
'My name is Slim'

SO what will these 3 arguments do? What I think is as follows:

max_length=5 will keep all the sentences as of length 5 strictly
padding=max_length will add a padding of 1 to the third sentence
truncate=True will truncate the first and second sentence so that their length will be strictly 5.

Please correct me if I am wrong.

Below is my code which I have used.

! pip install transformers==3.5.1

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

tokens = tokenizer.batch_encode_plus(text,max_length=5,padding='max_length', truncation=True)
  
text_seq = torch.tensor(tokens['input_ids'])
text_mask = torch.tensor(tokens['attention_mask'])

score 26 · Accepted Answer · answered Dec 11 '20 at 16:58

26

What you have assumed is almost correct, however, there are few differences.

max_length=5, the max_length specifies the length of the tokenized text. By default, BERT performs word-piece tokenization. For example the word "playing" can be split into "play" and "##ing" (This may not be very precise, but just to help you understand about word-piece tokenization), followed by adding [CLS] token at the beginning of the sentence, and [SEP] token at the end of sentence. Thus, it first tokenizes the sentence, truncates it to max_length-2 (if truncation=True), then prepend [CLS] at the beginning and [SEP] token at the end.(So a total length of max_length)

padding='max_length', In this example it is not very evident that the 3rd example will be padded, as the length exceeds 5 after appending [CLS] and [SEP] tokens. However, if you have a max_length of 10. The tokenized text corresponds to [101, 2026, 2171, 2003, 11754, 102, 0, 0, 0, 0], where 101 is id of [CLS] and 102 is id of [SEP] tokens. Thus, padded by zeros to make all the text to the length of max_length

Likewise, truncate=True will ensure that the max_length is strictly adhered, i.e, longer sentences are truncated to max_length only if truncate=True

answered Dec 11 '20 at 16:58

Ashwin Geet D'Sa

6,346
2
31
59

Thanks a lot for this detailed answer. I want truncation because I am working with classification problem.so I can not work with variable lengths. Also I have one more doubt regarding the word-piece. Vocab is already built using the 300B texts in my opinion. So is it possible that it can change `zaxis` to `z axis` and `currentI` to `current I`? also will it be better to use `Lemmatization` and/or `Stemming. Also can it work with `sin,cos,theta,gamma` etc. I think these were a part of 300B texts too because it predicts correct label based on these only. – Deshwal Dec 12 '20 at 05:28
1

We cannot assure if `currentI` will be changed to `current I`. As the word `current` alone can be split into pieces. You need not perform lemmatization, I have an answer on this at: https://stackoverflow.com/questions/57057992/wordpiece-tokenization-versus-conventional-lemmatization/57072351#57072351 – Ashwin Geet D'Sa Dec 12 '20 at 10:27
I have 6 elements in my preProcessing. Can you give your feedback on which one should I use. **Lemmatization, Stemming, number removal** (any float or int), **single length word removal** (x,y,i,a,b), **change any num as "number"** (2 = number, 123.43=number), **remove stopwords, remove special characters**? Could you provide your feedback on this one? – Deshwal Dec 13 '20 at 05:48
1

And your another answer was very descriptive and gives a lot of details. Thanks for helping people like me. – Deshwal Dec 13 '20 at 05:52
Can you please help me with the comment problem above? Any guesses? – Deshwal Dec 14 '20 at 07:46
preProcecssing? – Ashwin Geet D'Sa Dec 14 '20 at 08:17
`PreProcessing` is a function which uses 0-6 different techniques described above. Which one should I use? Data is Physics, Chemistry, Maths and Biology. Lemmatization and stemming won't work as you have described above. Removing stop words would lose the semantic contex too. So keeping thse things in mind, which ones should I use? – Deshwal Dec 15 '20 at 13:20
Preprocessing really depends on the choice of application, if you are using BERT, i would generally recommend not to perform Lemmatization. However, preprecessing `2 = number`, etc usually doesn't make much difference. You can still keep single length word (stop words), as these would add some or the other info. – Ashwin Geet D'Sa Dec 15 '20 at 13:56
1

Oh! Okay. Thanks. I am using BERT only. So I think these things won't be that helpful. For my custom models, things were different. – Deshwal Dec 16 '20 at 06:35
@AshwinGeetD'Sa Thx for the excellent answer! Could you please further elaborate on **why** is max_length a soft limit and how does it work? – Ondrej Sotolar May 10 '22 at 13:42

How does max_length, padding and truncation arguments work in HuggingFace' BertTokenizerFast.from_pretrained('bert-base-uncased')?

1 Answers1

Linked