5

I'm training a few Keras models one after the other on a remote server using userdocker. I connect to the server through ssh and let them run on different Screens.

To speed things up I had the models train on 5 GPUs, so that 5 different models train at the same time.

Most times the models train without any issues: I detach the screen, logout from the server and let them run overnight. Sometimes, however, they stop with a broken pipe message in the middle of training. Below I include the last part of the message as I think it might be the most relevant, but it's very long and repetitive.

I found this question with a slightly similar error message and they they linked to this explanation, but I fail to see how I can fix it in my case or where I'm making a mistake in my code that might lead to this problem as it doesn't always happen.

Has anyone encountered a similar problem when using Keras or userdocker? How can one prevent it from happening?

Error Message


23/24 [===========================>..] - ETA: 7s - loss: 1.7797 - acc: 0.2219 Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/keras/utils/data_utils.py", line 655, in _data_generator_task self.queue.put((True, generator_output))
File "", line 2, in put
File "/usr/lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod kind, result = conn.recv() EOFError

Process Process-259: Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/data_utils.py", line 665, in _data_generator_task self.queue.put((False, e))
File "", line 2, in put File "/usr/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe

Traceback (most recent call last): File "main.py", line 406, in group_main(sys.argv[1:])
File "main.py", line 346, in group_main steps_per_epoch)
File "/netscratch/user01/perge/scripts/pycharm_perge/model_trainer.py", line 78, in train_model steps_per_epoch=steps_per_epoch, epochs=epochs, callbacks=callBacksList)
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 1256, in fit_generator initial_epoch=initial_epoch)
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 2195, in fit_generator workers=0) File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 2310, in evaluate_generator generator_output = next(output_generator)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/data_utils.py", line 751, in get if not self.queue.empty(): File "", line 2, in empty File "/usr/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod conn.send((self._id, methodname, args, kwds)) IOError: [Errno 32] Broken pipe

Girauder
  • 165
  • 1
  • 12

0 Answers0