BERT HuggingFace gives NaN Loss

Question

I'm trying to fine-tune BERT for a text classification task, but I'm getting NaN losses and can't figure out why.

First I define a BERT-tokenizer and then tokenize my text:

from transformers import DistilBertTokenizer, RobertaTokenizer
distil_bert = 'distilbert-base-uncased' 

tokenizer = DistilBertTokenizer.from_pretrained(distil_bert, do_lower_case=True, add_special_tokens=True,
                                                max_length=128, pad_to_max_length=True)

def tokenize(sentences, tokenizer):
    input_ids, input_masks, input_segments = [],[],[]
    for sentence in tqdm(sentences):
        inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=25, pad_to_max_length=True, 
                                             return_attention_mask=True, return_token_type_ids=True)
        input_ids.append(inputs['input_ids'])
        input_masks.append(inputs['attention_mask'])
        input_segments.append(inputs['token_type_ids'])        

    return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')

train = pd.read_csv('train_dataset.csv')
d = train['text']
input_ids, input_masks, input_segments = tokenize(d, tokenizer)

Next, I load my integer labels which are: 0, 1, 2, 3.

d_y = train['label']
0    0
1    1
2    0
3    2
4    0
5    0
6    0
7    0
8    3
9    1
Name: label, dtype: int64

Then I load the pretrained Transformer model and put layers on top of it. I use SparseCategoricalCrossEntropy Loss when compiling the model:

from transformers import TFDistilBertForSequenceClassification, DistilBertConfig, AutoTokenizer, TFDistilBertModel

  distil_bert = 'distilbert-base-uncased'
  optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.0000001)

  config = DistilBertConfig(num_labels=4, dropout=0.2, attention_dropout=0.2)
  config.output_hidden_states = False
  transformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)

  input_ids_in = tf.keras.layers.Input(shape=(25,), name='input_token', dtype='int32')
  input_masks_in = tf.keras.layers.Input(shape=(25,), name='masked_token', dtype='int32') 

  embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in)[0]
  X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
  X = tf.keras.layers.GlobalMaxPool1D()(X)
  X = tf.keras.layers.Dense(50, activation='relu')(X)
  X = tf.keras.layers.Dropout(0.2)(X)
  X = tf.keras.layers.Dense(4, activation='softmax')(X)
  model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)

  for layer in model.layers[:3]:
    layer.trainable = False

  model.compile(optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['sparse_categorical_accuracy'],
    )

Finally, I run the model using previously tokenized input_ids and input_masks as inputs to the model and get a NAN Loss after the first epoch:

model.fit(x=[input_ids, input_masks], y = d_y, epochs=3)

    Epoch 1/3
20/20 [==============================] - 4s 182ms/step - loss: 0.9714 - sparse_categorical_accuracy: 0.6153
Epoch 2/3
20/20 [==============================] - 0s 19ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
Epoch 3/3
20/20 [==============================] - 0s 20ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
<tensorflow.python.keras.callbacks.History at 0x7fee0e220f60>

EDIT: The model computes losses on the first epoch but it starts returning NaNs at the second epoch. What could be causing that problem???

Does anyone has any ideas about what I am doing wrong? All suggestions are welcomed!

This answer might help :=> https://stackoverflow.com/a/40434284/8405902 — Simbarashe Timothy Motsi, Jun 17 '20 at 19:10
@beginner did you manage to solve this? i face the same problem. — Freddy Chua, Aug 13 '20 at 10:54

score 4 · Answer 1 · edited Feb 24 '21 at 13:54

The problem would have occurred because of not specifying the num_labels

At the final output layer, by default K = 1 (number of labels), and as mentioned \sigma(\vec{z})_{i}=\frac{e^{z_{i}}}{\sum_{j=1}^{K} e^{z_{j}}}

so while fine tuning we need to provide num_labels when going for multi class classification.

model = TFBertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5)

score 2 · Answer 2 · answered Jun 18 '20 at 07:59

2

The problem is here:

X = tf.keras.layers.Dense(1, activation='softmax')(X)

At the end of the network, you only have a single neuron, corresponding to a single class. The output probability is always 100% for class 0. If you have classes 0, 1, 2, 3, you need to have 4 outputs at the end.

answered Jun 18 '20 at 07:59

Jindřich

10,270
2
23
44

I have changed it to 4 outputs but the problem seems to persist. – beginner Jun 18 '20 at 09:17
It computes the loss for the first epoch but from the second epoch and onward losses are NaN. – beginner Jun 18 '20 at 09:19
The code snippet looks fine now. The most frequent reason for getting nans is dividing by zero. It might come from the data, e.g., you might have a mask set to all zeros. – Jindřich Jun 18 '20 at 11:16

score 1 · Answer 3 · answered Jun 21 '20 at 23:14

1

I'd also suggest removing NA values from the pandas data frame before using the dataset for training and evaluation.

train = pd.read_csv('train_dataset.csv')
d = train['text']
d = d.dropna()

answered Jun 21 '20 at 23:14

user12769533

258
2
6

score 1 · Answer 4 · answered Aug 18 '21 at 14:12

I had a similar problem where my model produced NaN losses only during the last batch of an epoch. All the other batches resulted in typical loss values. In my case, the problem was that the size of the batches was not always equal. Thus, the model produced NaN losses. After I made all batches equally sized, the NaN's were gone. It might be also worth investigating if this is also true in your case.

BERT HuggingFace gives NaN Loss

4 Answers4