4

I use a BiLSTM-CRF architecture to assign some labels to a sequence of the sentences in a paper. We have 150 papers each of which contains 380 sentences and each sentence is represented by a double array with size 11 in range (0,1) and the number of class labels is 11.

input = Input(shape=(None,11))
mask = Masking (mask_value=0)(input)
lstm = Bidirectional(LSTM(50, return_sequences=True))(mask)
lstm = Dropout(0.3)(lstm)
lstm= TimeDistributed(Dense(50, activation="relu"))(lstm)
crf = CRF(11 , sparse_target=False , learn_mode='join')  # CRF layer
out = crf(lstm)
model = Model(input, out)
model.summary()
model.compile('adam', loss=crf.loss_function, metrics=[crf.accuracy])

I use keras-contrib package to implement CRF layer. CRF layer has two learning modes: join mode and marginal mode. I know that join mode is a real CRF that uses viterbi algorithm to predict the best path. While, marginal mode is not a real CRF that uses categorical-crossentropy for computing loss function. When I use marginal mode, the output is as follows:

Epoch 4/250:  - 6s - loss: 1.2289 - acc: 0.5657 - val_loss: 1.3459 - val_acc: 0.5262

But, in the join mode, the value of loss function is nan:

Epoch 2/250 :  - 5s - loss: nan - acc: 0.1880 - val_loss: nan - val_acc: 0.2120

I do not understand why this happens and would be grateful to anybody who could give a hint.

Nasrin
  • 41
  • 2
  • Facing a similar problem (different network and job, but nan as loss), I found the following pages: https://stackoverflow.com/questions/37232782/nan-loss-when-training-regression-network and https://github.com/keras-team/keras/issues/2134 – Eulenfuchswiesel Jan 22 '19 at 14:11
  • It can happen due to diminishing gradients or exploding gradients, for solving this you should tweak learning rate or you can decrease the learning rate by some fraction after every epoch. – hodophile Jul 26 '20 at 09:47

0 Answers0