Loaded keras model fails to continue training, dimensions mismatch

Question

I'm using tensorflow with keras to train to a char-RNN using google colabs. I train my model for 10 epochs and save it, using 'model.save()' as shown in the documentation for saving models. Immediately after, I load it again just to check, I try to call model.fit() on the loaded model and I get a "Dimensions must be equal" error using the exact same training set. The training data is in a tensorflow dataset organised in batches as shown in the documentation for tf datasets. Here is a minimal working example:

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

X = np.random.randint(0,50,(10000))

seq_len = 150
batch_size = 20
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset = dataset.batch(seq_len+1,drop_remainder=True)
dataset = dataset.map(lambda x: (x[:-1],x[1:]))
dataset = dataset.shuffle(20).batch(batch_size,drop_remainder=True)

def make_model(vocabulary_size,embedding_dimension,rnn_units,batch_size,stateful):
  model = Sequential()
  model.add(Embedding(vocabulary_size,embedding_dimension,
                      batch_input_shape=[batch_size,None]))
  model.add(LSTM(rnn_units,return_sequences=True,stateful=stateful))
  model.add(Dense(vocabulary_size))
  model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                optimizer='adam',metrics=['accuracy'])
  model.summary()
  return model

vocab_size = 51
emb_dim = 20
rnn_units = 10
model = make_model(vocab_size,emb_dim,rnn_units,batch_size,False)

model.fit(dataset,epochs=10)
model.save('/content/test_model')
model2 = tf.keras.models.load_model('/content/test_model')
model2.fit(dataset,epochs=10)

The first training line, "model.fit()", runs fine but the last line returns the error:

ValueError: Dimensions must be equal, but are 20 and 150 for '{{node Equal}} = Equal[T=DT_INT64, incompatible_shape_error=true](ArgMax, ArgMax_1)' with input shapes: [20], [20,150].

I want to be able to resume training later, as my real dataset is much larger. Therefore, saving only the weights is not an ideal option.

Any advice? Thanks!

score 0 · Answer 1 · answered Dec 05 '20 at 16:13

0

If you have saved checkpoints than, from those checkpoints, you can resume with reduced dataset. Your neural network / layers and dimensions should be same.

answered Dec 05 '20 at 16:13

ML85

709
7
19

Saving the model using checkpoints results in the same behaviour as using model.save(), when I try to train the loaded model I still getting the same error. – gmedina-v Dec 06 '20 at 14:14
what is emb_dim = 20 ? because this seems new to the saved model. also, before the data goes into model, check the shape with data.shape() and see if the data is in same shape.. I assume your new configuration is detecing diff shape because of above stated reason. – ML85 Dec 06 '20 at 14:27
'emb_dim' is the embedding dimension for the Embedding layer, the first layer in the model. The description of the 'dataset' object is: . However, the '20' in the error message is from batch_size. Setting batch_size = 30 from the start, the error message is: "Dimensions must be equal, but are 30 and 150 for '{{node Equal}} = Equal[T=DT_INT64, incompatible_shape_error=true](ArgMax, ArgMax_1)' with input shapes: [30], [30,150]". Thus, this must be related to using the BatchDataset object – gmedina-v Dec 06 '20 at 15:35
data.reshape() can help you if you can reshape accordingly what input demands. – ML85 Dec 06 '20 at 16:07
Note the problem is not the shape of the data. Training on a freshly built model works just fine. It is only training the loaded model that displays the error. I have also verified that the issue is not related to using the BatchDataset object, as I also get the same error if I use numpy arrays for the dataset and pass the batch size in the fit line. – gmedina-v Dec 06 '20 at 17:22

score 0 · Accepted Answer · answered Dec 06 '20 at 18:56

The problem is the 'accuracy' metric. For some reason, there is some mishandling of dimensions on the predictions when the model is loaded with this metric, as I found in this thread (see last comment). Running model.compile() on the loaded model with the same metric allows training to continue. However, it shouldn't be necessary to compile the model again. Moreover, this means that the optimiser state is lost, as explained in this answer, thus, this is not very useful for resuming training.

On the other hand, using 'sparse_categorical_accuracy' from the start works just fine. I am able to load the model and continue training without having to recompile. In hindsight, this choice is more appropriate given that the outputs of my last layer are logits over the distribution of characters. Thus, this is not a binary but a multiclass classification problem. Nonetheless, I verified that both 'accuracy' and 'sparse_categorical_accuracy' returned the same values in my specific example. Thus, I believe that keras is internally converting accuracy to categorical accuracy, but something goes wrong when doing this on a model that has been just loaded which forces the need to recompile.

I also verified that if the saved model was compiled with 'accuracy', loading the model and recompiling with 'sparse_categorical_accuracy' will allow resuming training. However, as mentioned before, this would discard the state of the optimiser and I suspect that it would be no better than just making a new model and loading only the weights from the saved one.

Loaded keras model fails to continue training, dimensions mismatch

2 Answers2