1

I have written a neural network in line with the tensorflow guide on distributed training: https://www.tensorflow.org/deploy/distributed

If the cluster I would like to run the training on uses torque for job scheduling and distributing, how does this fit in with tensorflow and how it distributes the training over the cluster?

Do I set the training on one node in torque and let tensorflow distribute it from there, or would that clash with the functioning of torque. Is torque even necessary at all if tensorflow can handle distributions? How do I avoid clashes between the two?

Thanks in advance.

Devon Jarvis
  • 118
  • 8

1 Answers1

0

Torque and distributed tensorflow are responsible for different tasks that are not directly related to each other. Torque is there to distribute the resources of a cluster to multiple jobs. Within one job only the according requested resources will be available. Distributed tensorflow is there to parallelize the tensorflow task between the available resources (within one job).

Normally you would use torque to get all the needed resources for the tensorflow task and then use distributed tensorflow to distribute the task over the resources that were provided by torque.

If tf.train.ClusterSpec is initialized correctly with the resources made available by torque, there will be no conflicts.

BlueSun
  • 3,541
  • 1
  • 18
  • 37
  • Thanks for the help BlueSun, your answer helped a lot. I have however run into a related problem. When I run a tensorflow training session from the headnode as one job, I get the following error: "ImportError: No module named tensorflow" however tensorflow is installed on all nodes of the cluster. I have tried using the torque job file to open a tensorflow shell on every node using "source activate tensorflow" in the pbs file however this too didn't help. What are some possible solutions I should look into that may help? – Devon Jarvis Oct 05 '17 at 12:16
  • @DevonJarvis There could be many reasons for the ImportError. You can try reading through the answers of the question: https://stackoverflow.com/questions/14295680/cannot-import-a-python-module-that-is-definitely-installed-mechanize – BlueSun Oct 05 '17 at 17:39