Understanding tensorflow inter/intra parallelism threads

Question

I would like to understand a little more about these two parameters: intra and inter op parallelism threads

session_conf = tf.ConfigProto(
  intra_op_parallelism_threads=1,
  inter_op_parallelism_threads=1)

I read this post which has a pretty good explanation: TensorFlow: inter- and intra-op parallelism configuration

But I am seeking confirmations and also asking new questions below. And I am running my task in keras 2.0.9, tensorflow 1.3.0:

when both are set to 1, does it mean that, on a computer with 4 cores for example, there will be only 1 thread shared by the four cores?
why using 1 thread does not seem to affect my task very much in terms of speed? My network has the following structure: dropout, conv1d, maxpooling, lstm, globalmaxpooling,dropout, dense. The post cited above says that if there are a lot of matrix multiplication and subtraction operations, using a multiple thread setting can help. I do not know much about the math underneath but I'd imagine there are quite a lot of such matrix operations in my model? However, setting both params from 0 to 1 only sees a 1 minute slowdown over a 10 minute task.
why multi-thread could be a source of non-reproducible results? See Results not reproducible with Keras and TensorFlow in Python. This is the main reason I need to use single threads as I am doing scientific experiments. And surely tensorflow has been improving over the time, why this is not addressed in the release?

Many thanks in advance

score 7 · Accepted Answer · answered Nov 29 '17 at 19:54

7

When both parameters are set to 1, there will be 1 thread running on 1 of the 4 cores. The core on which it runs might change but it will always be 1 at a time.
When running something in parallel there is always a trade-off between lost time on communication and gained time through parallelization. Depending on the used hardware and the specific task (like the size of the matrices) the speedup will change. Sometimes running something in parallel will be even slower than using one core.
For example when using floats on a cpu, (a + b) + c will not be equal to a + (b + c) because of the floating point precision. Using multiple parallel threads means that operations like a + b + c will not always be computed in the same order, leading to different results on each run. However those differences are extremely small and will not effect the overall result in most cases. Completely reproducible results are usually only needed for debugging. Enforcing complete reproducibility would slow down multi-threading a lot.

answered Nov 29 '17 at 19:54

BlueSun

3,541
1
18
37

I don't agree with point 1 of your answer. In this case, tensorflow will generate 4 threads. – fisakhan Aug 28 '20 at 14:28
@Fisa Well, yes and no. It's true that tensorflow will spawn a lot of threads, more than 4 actually even on a 4 core machine. I only mean the main thread that is executing the tf graph. That one will only use one core if intra and inter threads are set to 1. To check this you have to look at the core loads and not the amount of threads. Run some cpu heavy tf graph, 1 time with normal session and 1 time with intra and inter threads set to 1. You will see that in the first case all cores will have load and in the second case only 1 core will have 100% load and the others will be at ~0%. – BlueSun Aug 28 '20 at 15:22
To make sure that no other cores are used, not even for minimal load, you can restrict the python process to one core. For example this should work: `taskset -c 0 python3 main.py`. Tensorflow will still create many threads but now they all should be restricted to one core. – BlueSun Aug 28 '20 at 15:36
Thanks @BlueSun but Question 1 is not about the load per core or cpu but about the number/count of generated threads. – fisakhan Aug 29 '20 at 18:04
do you know how to restrict/confine TensorFlow to generate not more than 1 threads (no matter what the number of cores is)? – fisakhan Aug 29 '20 at 18:08
@Fisa there is no way to reduce the amount of threads to 1, unless you rewrite some parts of tensorflow. But I think you miss the point. There is no way to reduce the threads because there is no need to reduce the threads, mostly idle sleeping threads should not pose any problems. The intra and inter threads parameters are there to control how much resources are used and how much the calculations are parallelized, not to reduce the total thread number, and those are working as intended. Why would you insist on having only 1 thread? – BlueSun Aug 29 '20 at 18:23
I insist on reducing the number of thread to 1 or 2 because it is a requirement from NIST FRVT. Please click on the following links for more details about such a requirement. https://github.com/usnistgov/frvt/issues/12 and https://stackoverflow.com/questions/60206113/how-to-stop-tensorflow-from-multi-threading – fisakhan Aug 31 '20 at 10:31
@Fisa I don't understand why it is a requirement for them. Are they running multiple processes on each cpu core? Anyway, it is interesting that apparently someone managed to have only 1-2 tf threads. Have your tried building tf from source with xla and cuda disabled? Also makes sense to test it with `taskset -c 0` since they lock to a single cpu core as well. – BlueSun Sep 02 '20 at 19:23
1

taskset, numatcl and setting thread affinity can lock a given process to a given core but don't change the number of threads. Setting OMP_NUM_THREADS from environment can change the number of but TensorFlow overwrites those setting and generate N threads. Now I try to build TensorFlow C++ API from source using Bazel, may be that works. – fisakhan Sep 03 '20 at 14:38
Building tensorflow from source is headache. – fisakhan Sep 03 '20 at 14:42

score 0 · Answer 2 · answered Aug 28 '20 at 14:26

Answer to question 1 is "No".

Setting both the parameters to 1 (intra_op_parallelism_threads=1, inter_op_parallelism_threads=1) will generate N threads, where N is the count of cores. I've tested it multiple times on different versions of TensorFlow. This is true even for latest version of TensorFlow. There are multiple questions on how to reduce the number of threads to 1 but with no clear answer. Some examples are

Understanding tensorflow inter/intra parallelism threads

2 Answers2

Linked