2

I want to deploy a keras model with tensorflow-serving. The model is converted from a keras .h5 model to a .pb file. ( the original model comes from [here][https://github.com/shaoanlu/face_toolbox_keras))

When performing inference with keras on this model, if I'm using only my CPU, the 12 cores are working and inference takes on average 0.7s.

When converting the model and using tensorflow serving, it uses only one core, and takes on average 2.7s.

I tried setting options like --tensorflow_session_parallelism, --tensorflow_intra_op_parallelism and --tensorflow_inter_op_parallelism to 12, but nothing changes, only one core working when looking at top from inside the tfserving container.

I tried also compiling the tensorflow-serving for my machine's architecture, and i'm getting a slight improvement (2.7s to 2.5s), but I can't control the number of cores used per session.

I supposed it's nice that the other cores are available for concurrent requests, but I'd like to have more control.

supervino256
  • 119
  • 5

1 Answers1

1

The issue can be because of constant folding pass. Using tf.placeholder should resolve the issue.

                if args.const_fold:
                    A = tf.ones([size, size], name=("A%s" % i))
                    B = tf.ones([size, size], name=("B%s" % i))
                else:
                    A_name = "A%s" % i
                    B_name = "B%s" % i
                    A = tf.placeholder(tf.float32, shape=[size, size], name=A_name)
                    B = tf.placeholder(tf.float32, shape=[size, size], name=B_name)
                    feed["%s:0" % A_name] = np.random.rand(size, size)
                    feed["%s:0" % B_name] = np.random.rand(size, size)

As per my understanding, your code might be as shown in the if block as shown above. Changing it to else block should resolve the issue.

For more information, please refer this Stack Overflow Link