How to use OpenMP parallelism effectively with tensorflow 1.14.0

Question

I'm currently trying to find an effective way of running a machine learning task over a set amount of cores using tensorflow. From the information I found there were two main approaches to doing this.

The first of which was using the two tensorflow variables intra_op_parallelism_threads and inter_op_parallelism_threads and then creating a session using this configuration.

The second of which is using OpenMP. Setting the environment variable OMP_NUM_THREADS allows for manipulation of the amount of threads spawned for the process.

My problem arose when I discovered that installing tensorflow through conda and through pip gave two different environments. In the conda install modifying the OpenMP environment variables seemed to change the way the process was parallelised, whilst in the 'pip environment' the only thing which appeared to change it was the inter/intra config variables which I mentioned earlier.

This led to some difficulty in trying to compare the two installs for benchmarking reasons. If I set OMP_NUM_THREADS equal to 1 and inter/intra to 16 on a 48 core processor on the conda install I only get about 200% CPU usage as most of the threads are idle at any given time.

omp_threads = 1
mkl_threads = 1
os.environ["OMP_NUM_THREADS"] = str(omp_threads)
os.environ["MKL_NUM_THREADS"] = str(mkl_threads)

config = tf.ConfigProto()
config.intra_op_parallelism_threads = 16
config.inter_op_parallelism_threads = 16
session = tf.Session(config=config)
K.set_session(session)

I would expect this code to spawn 32 threads where most of which are being utilized at any given time, when in fact it spawns 32 threads and only 4-5 are being used at once.

Has anyone ran into anything similar before when using tensorflow?

Why is it that installing through conda and through pip seems to give two different environments?

Is there any way of having comparable performance on the two installs by using some combination of the two methods discussed earlier?

Finally is there maybe an even better way to limit python to a specific number of cores?

score 1 · Answer 1 · answered May 31 '21 at 08:02

I think the point here is that conda install Tensorflow with MKL but pip does not.

OpenMP control only works in MKL, and in pip install, the OpenMP environment variable does not work and only set TFSessionConfig with intra/inter could affect multi-threading

score 0 · Answer 2 · answered Oct 15 '20 at 12:01

Answer to your first and last question.

Yes I ran into a similar situation while using TensorFlow installed through pip. You can limit python to a specific number of cores by using thread affinity, numatcl or taskset on linux.

Looking at the details provied by the following links, TensorFlow will always generate multiple threads and most of them will be sleeping by default.

How to use OpenMP parallelism effectively with tensorflow 1.14.0

2 Answers2