After upgrading my notebook's operating system to Ubuntu 18.04 from 16.04 I noticed how my keras code (using tensorflow backend) became incredibly slow in my conda environment where I had tensorflow-gpu installed.
Basically, it seems like some simple CNN models takes now forever (like if they had been using CPU instead) to train, even though a simple inspection via nvidia-smi
command reveals a python process being engaged by the GPU (Nvidia GeForce GTX1070).
I then thought about updating CUDA libraries (from version 7 to version 9) and accordingly updating the CUDnn to be compatible with the new version.
I also updated the tensorflow-gpu and keras packages to the latest version, but still it appears to run ways slower than in my previous setup.
To show an example, here's a fragment of code that I am running, with the model being the following:
model= Sequential()
model.add(Conv2D(32,(3,3), activation='relu', input_shape=(100,60,3)))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(64,(3,3), activation='relu'))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(128,(3,3), activation='relu'))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(128,(3,3), activation='relu'))
model.add(MaxPooling2D((2,2)))
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(512, activation='relu'))
model.add(Dense(26, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
which would produce its output after few seconds:
history = model.fit_generator(train_generator, steps_per_epoch=260000 // 32,
epochs=10,
validation_data=test_generator,
validation_steps=52000//32,
verbose=2
)
Epoch 1/10
166s - loss: 0.2333 - acc: 0.9291 - val_loss: 0.0073 - val_acc: 0.9982
now each epoch takes a very long amount of time (more than 45 minutes instead of 166 seconds!). Anyone has an idea why this is happening? Do I need to revert to Ubuntu 16.04? Am pretty upset about this behaviour...
edited....
I tried to run a performance test between cpu and gpu using the models found in https://medium.com/@andriylazorenko/tensorflow-performance-test-cpu-vs-gpu-79fcd39170c and my gpu seems to work well: over 10000 samples processed per second on average in each epoch, vs around 400 samples per second when run in CPU mode. HOWEVER, my code (written in keras) still produces a weird behaviour. This is the ETA expected to finish (I never let it finish tho, since it would take hours) an epoch under GPU after my ubuntu update:
Epoch 1/1
6/507 [..............................] - ETA: 5:19:17 - loss: 3.2632 - acc: 0.0397
while this is the same output produced with keras in a normal CPU environment using always tensorflow as backend
Epoch 1/1
4/507 [..............................] - ETA: 4850s - loss: 3.2671 - acc: 0.0293
there's someething wrong going on in keras, apparently....