7

In trying to evaluate several transformers models sequentially with the same dataset to check which one performs better.

The list of models is this one:

MODELS = [
      ('xlm-mlm-enfr-1024'   ,"XLMModel"),
      ('distilbert-base-cased', "DistilBertModel"),
      ('bert-base-uncased'     ,"BertModel"),
      ('roberta-base'        ,"RobertaModel"),
      ("cardiffnlp/twitter-roberta-base-sentiment","RobertaSentTW"),
      ('xlnet-base-cased'     ,"XLNetModel"),
      #('ctrl'                ,"CTRLModel"),
      ('transfo-xl-wt103'    ,"TransfoXLModel"),
      ('bert-base-cased'       ,"BertModelUncased"),
      ('xlm-roberta-base'     ,"XLMRobertaModel"),
      ('openai-gpt'           ,"OpenAIGPTModel"),
      ('gpt2'                 ,"GPT2Model")

All of them work fine until the 'ctrl' model, which returns this error:

Asking to pad, but the tokenizer does not have a padding token. Please select a token to use as 'pad_token' '(tokenizer.pad_token = tokenizer.eos_token e.g.)' or add a new pad token via 'tokenizer.add_special_tokens({'pad_token': '[PAD]'})'.

When tokenizing the sentences of my dataset.

The tokenizing code is

SEQ_LEN = MAX_LEN #(50)

for pretrained_weights, model_name in MODELS:

print("***************** INICIANDO " ,model_name,", weights ",pretrained_weights, "********* ")
print("carganzo el tokenizador ()")
tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
print("creando el modelo preentrenado")
transformer_model = TFAutoModel.from_pretrained(pretrained_weights)
print("aplicando el tokenizador al dataset")

##APLICAMOS EL TOKENIZADOR##

def tokenize(sentence):
  
  tokens = tokenizer.encode_plus(sentence, max_length=MAX_LEN,
                               truncation=True, padding='max_length',
                               add_special_tokens=True, return_attention_mask=True,
                               return_token_type_ids=False, return_tensors='tf')
  return tokens['input_ids'], tokens['attention_mask']

# initialize two arrays for input tensors
Xids = np.zeros((len(df), SEQ_LEN))
Xmask = np.zeros((len(df), SEQ_LEN))

for i, sentence in enumerate(df['tweet']):
    Xids[i, :], Xmask[i, :] = tokenize(sentence)
    if i % 10000 == 0:
        print(i)  # do this so we can see some progress


arr = df['label'].values  # take label column in df as array

labels = np.zeros((arr.size, arr.max()+1))  # initialize empty (all zero) label array
labels[np.arange(arr.size), arr] = 1  # add ones in indices where we have a value`

I have tried to define the padding tokens as the solution tells me, but then this error appears

could not broadcast input array from shape (3,) into shape (50,)

in line

Xids[i, :], Xmask[i, :] = tokenize(sentence)

I have also tried this solution, and it doesn't work either.

If you have managed to read until here, thank you.

Any help is needed.

David Beauchemin
  • 231
  • 1
  • 2
  • 12
Pablo Cordon
  • 249
  • 2
  • 4
  • 11
  • 1
    `could not broadcast input array from shape (3,) into shape (50,)` says that the shape of tensors returned from `tokenize` was `3` while `Xids` has space reserved for tensors of shape `50`. The shape mismatches. When you do `return tokens['input_ids'], tokens['attention_mask']`, make sure both tensors are of shape `SEQ_LEN`, if not `pad them with zeros` or clip them. Find out a way to do so in tensorflow as you are using tensorflow `return_tensors='tf'`. I know only pytorch – harshraj22 Jan 01 '22 at 11:57

3 Answers3

8

kkgarg idea was right, but you also need to update your model token embeding size. So, the code will be:

tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
model = TFAutoModel.from_pretrained(pretrained_weights)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

Check this related issue.

Googr
  • 166
  • 2
  • 10
  • Would there be a potential downside to including the last line of code (model.resize_...) in the if-branch? This line of code will only be needed when the PAD token was added, so the embeddings only need to be resized in this case? – Jan Spörer Dec 20 '22 at 11:25
  • this doesn't actually work anymore, see: https://stackoverflow.com/questions/76633368/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token/76639568#76639568 – Charlie Parker Jul 08 '23 at 02:11
6

You can add the [PAD] token using add_special_tokens API.

tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
kkgarg
  • 1,246
  • 1
  • 12
  • 28
  • Did the answer work for you? – kkgarg Jan 07 '22 at 20:09
  • This correctly added the padding token to the tokenizer, but afterwards I got another error because my embeddings did not expect this new token. I found [this issue](https://github.com/huggingface/transformers/issues/3021) that mimics the padding behavior using the attention mask. – zolastro Jun 03 '22 at 11:28
  • you still need to add this to the embedding table if the model doesn't have a pad token. – Charlie Parker Jul 08 '23 at 02:10
0

You can also try to assign the eos_token (end-of-sentence token) to the pad_token.

tokenizer.pad_token = tokenizer.eos_token

qing guo
  • 21
  • 2