0

I'm running a series of neural networks (Keras library using a Tensorflow backend), and I have the following results for the time it took to train each neural network in Jupyter Notebook:

ELAPSED TIME: 2.7005105018615723
0
ELAPSED TIME: 2.4810903072357178
1
ELAPSED TIME: 2.801435708999634
2
ELAPSED TIME: 2.6753993034362793
3
ELAPSED TIME: 2.8625667095184326
4
ELAPSED TIME: 2.5828065872192383
5

while later on you have:

ELAPSED TIME: 5.062163829803467
0
ELAPSED TIME: 5.162402868270874
1
ELAPSED TIME: 5.301288366317749
2
ELAPSED TIME: 5.386904001235962
3
ELAPSED TIME: 6.126806020736694
4

The program consists of a function that trains a separate neural network model on their respective datasets, and only exports their final training accuracy (saved to another file).

I thought the reason why it took longer for the latter networks to train was because the program was consuming too much memory, so I would delete the models (with the del keyword) after having obtained their training accuracy, but that doesn't seem to be doing much.

If I were to restart the Jupyter Notebook kernel, the time to run each network would shorten back to about 2 seconds (original duration), but it takes longer for the latter models to run.

What could be possible reasons for this, and what solutions could be implemented?

.

NOTE: I did not include any code because it would make this post more dense, but if necessary I can upload it.

2 Answers2

0

Are you running on an NVIDA GPU? If so it's possible some part of the old model is still on the GPU. Try running nvidia-smi while the slow model is running and see if anything else is using up GPU memory/resources.

If that doesn't work you can also run the tensorflow timeline and compare between the slow and fast runs. More info on how to generate the timeline in Keras here: https://github.com/tensorflow/tensorflow/issues/9868 and the code I pasted below for creating the timeline was taken from that link

from tensorflow.python.client import timeline

# Make your keras model
# ...
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
model.compile(loss='MSE', optimizer='Adam', options=run_options, run_metadata=run_metadata)
# Run model in your usual way
# ...
trace = timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.ctf.json', 'w') as f:
    f.write(trace.generate_chrome_trace_format())

For more info on the tensorflow timeline see https://stackoverflow.com/a/37774470/2826818 The timeline will allow you to see how much time each operation is taking and determine which one is causing the slow down.

alexbhandari
  • 1,310
  • 12
  • 21
0

You can clear your session after you are done with each model, this fixed the issue for me.

from keras import backend as K
for model in models_to_train:
    model.fit(X, y)
    # save necessary metrics
    K.clear_session()

(Also I was getting a segmentation fault until I update tensoflow-gpu to 1.9.0)