9

I'm trying to learn distributed TensorFlow. Tried out a piece code as explained here:

with tf.device("/cpu:0"):
    W = tf.Variable(tf.zeros([784, 10]))
    b = tf.Variable(tf.zeros([10]))

with tf.device("/cpu:1"):
    y = tf.nn.softmax(tf.matmul(x, W) + b)
    loss = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

Getting the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'MatMul': Operation was explicitly assigned to /device:CPU:1 but available devices are [ /job:localhost/replica:0/task:0/cpu:0 ]. Make sure the device specification refers to a valid device.
     [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/device:CPU:1"](Placeholder, Variable/read)]]

Meaning that TensorFlow does not recognize CPU:1.

I'm running on a RedHat server with 40 CPUs (cat /proc/cpuinfo | grep processor | wc -l).

Any ideas?

Elad Weiss
  • 3,662
  • 3
  • 22
  • 50
  • 1
    Do you have 40 cpus or 40 cores? – raam86 Aug 31 '17 at 16:00
  • raam86 according https://askubuntu.com/questions/724228/how-to-find-the-number-of-cpu-cores-including-virtual 40 cpus – Elad Weiss Aug 31 '17 at 16:03
  • I have once used multiple CPU processing using sci-kit learning (GridSearchCV function) over tensorflow backbone.. So I guess it is possible. However I'm not really sure how to implement it in tensorflow level – Eduardo Aug 31 '17 at 16:03
  • 1
    see if this can help you. https://stackoverflow.com/a/37864489/4834515 – LI Xuhong Aug 31 '17 at 16:30

2 Answers2

4

Following the link in the comment:

Turns out the session should be configured to have device count > 1:

config = tf.ConfigProto(device_count={"CPU": 8})
with tf.Session(config=config) as sess:
   ...

Kind of shocking that I missed something so basic, and no one could pinpoint to an error which seems too obvious.

Not sure if it's a problem with me or the TensorFlow code samples and documentation. Since it's Google, I'll have to say that it's me.

Elad Weiss
  • 3,662
  • 3
  • 22
  • 50
0

First, just run it on "one CPU", and see if Tensorflow distributes threads to all of the CPUs appropriately. It likely will multithread correctly and you won't have to do anything.

In the case where it doesn't, you should try launching multiple Tensorflow instances with different CPU affinities, and doing a "distributed" system. Tensorflow has distributed services for multiple machines; it should work as well with separate processes on one machine, as long as you correctly set up your files so that they aren't writing to the same locations. You can get start at https://www.tensorflow.org/deploy/distributed . You might want to set the CPU affinities so that it's one process per physical CPU, a-la https://askubuntu.com/questions/102258/how-to-set-cpu-affinity-to-a-process

Alex Meiburg
  • 655
  • 5
  • 20