Huge memory leak when combining OpenMP, Intel MKL and MSVC compiler

Question

I am working on a considerably large C++ project with a high emphasis on performance. It therefore relies on the Intel MKL library and on OpenMP. I recently observed a considerable memory leak that I could narrow down to the following minimal example:

#include <atomic>
#include <iostream>
#include <thread>

class Foo {
public:
  Foo() : calculate(false) {}

  // Start the thread
  void start() {
    if (calculate) return;
    calculate = true;
    thread = std::thread(&Foo::loop, this);
  }

  // Stop the thread
  void stop() {
    if (!calculate) return;

    calculate = false;
    if (thread.joinable())
      thread.join();
  }

private:
  // function containing the loop that is continually executed
  void loop() {
    while (calculate) {
      #pragma omp parallel
      {
      }
    }
  }

  std::atomic<bool> calculate;
  std::thread thread;
};


int main() {
  Foo foo;
  foo.start();
  foo.stop();
  foo.start();

  // Let the program run until the user inputs something
  int a;
  std::cin >> a;

  foo.stop();
  return 0;
}

When compiled with Visual Studio 2013 and executed, this code leaks up to 200 MB memory per second (!).

By modifying the above code only a little, the leak totally disappears. For instance:

If the program is not linked against the MKL library (which is obviously not needed here), there is no leak.
If I tell OpenMP to use only one thread, (i.e. I set the environment variable OMP_NUM_THREADS to 1), there is no leak.
If I comment out the line #pragma omp parallel, there is no leak.
If I don't stop the thread and start it again with foo.stop() and foo.start(), there is no leak.

Am I doing something wrong here or am I missing something ?

I'm not sure about this, but I think that combining different threading models/paradigms is generally not a good idea. For instance, on Linux, both OpenMP and C++11 threads internally uses Pthreads library, and therefore might interfere with each other. Anyway, could you please specify what do you mean by leaking? How do you observe it? — Daniel Langr, Jan 15 '16 at 16:56
Also, I believe you should use `atomic calculate{false};` instead of `atomic calculate=false;`, since `std::atomic` has deleted copy constructor and that is required for copy initialization even in the case of optimizing out the temporary. — Daniel Langr, Jan 15 '16 at 17:07
As a general rule of thumb, don't combine OpenMP with *anything*. — Mysticial, Jan 15 '16 at 17:28
@DanielLangr I see in the Windows task manager that the memory for the program increases very rapidly — oLen, Jan 15 '16 at 17:42
@DanielLangr Thanks, I edited the initialization of `calculate` — oLen, Jan 15 '16 at 17:42
@Mystical I asked about this in a previous post, but didn't really get a good answer... http://stackoverflow.com/questions/34316191/openmp-and-c11-multithreading — oLen, Jan 15 '16 at 17:44
Observation after using OpenMP extensively: never ever break one of the ***implicit*** expectations of the spec. In your code the "forking" does not happen on the "main" thread and is therefor undefined from OpenMP's POV... (see: http://stackoverflow.com/questions/13197510/why-do-c11-threads-become-unjoinable-when-using-nested-openmp-pragmas for one example of what will happen if you break the [implicit] expectations of OpenMP.) — MFH, Jan 15 '16 at 22:31

score 4 · Accepted Answer · edited May 23 '17 at 10:27

MKL's parallel (default) driver is built against Intel's OpenMP runtime. MSVC compiles OpenMP applications against its own runtime that is built around the Win32 ThreadPool API. Both most likely don't play nice. It is only safe to use the parallel MKL driver with OpenMP code built using Intel C/C++/Fortran compilers.

It should be fine if you link your OpenMP code with the serial driver of MKL. That way, you may call MKL from multiple threads at the same time and get concurrent serial instances of MKL. Whether n concurrent serial MKL calls are slower than, comparable to or faster than a single threaded MKL call on n threads is likely dependent on the kind of computation and the hardware.

Note that Microsoft no longer support their own OpenMP runtime. MSVC's OpenMP support is stuck at version 2.0, which is more than a decade older than the current specification. There are probably bugs in the runtime (and there are bugs in the compiler's OpenMP support itself) and those are not likely to get fixed. They don't want you to use OpenMP and would like you to favour their own Parallel Patterns Library instead. But PPL is not portable to other platforms (e.g. Linux), therefore you should really be using Intel Treading Building Blocks (TBB). If you want quality OpenMP support under Windows, use the Intel compiler or some of the GCC ports. (I don't work for Intel)

The PPL is "source code portable" as its' interface is a subset of the TBB! — MFH, Jan 16 '16 at 11:54
This is simply not true. The two libraries have similar APIs, trying to be as close as possible to the sequential STL algorithms, but PPL is not a subset of TBB and they are not fully compatible on source level. Immediate example: `concurrency::parallel_transform` from PPL. More [here](https://software.intel.com/en-us/node/506337). — Hristo Iliev, Jan 16 '16 at 21:07
Thanks for the useful answer. I'm still not so sure how I'm going to deal with the whole thing yet, as OpenMP was really practical so far for the computationally heavy parts, but at least I start to understand where the problem is... I'll definitely have a look at TBB and try to estimate whether it can be worthwhile to switch to it. — oLen, Jan 18 '16 at 16:52

Huge memory leak when combining OpenMP, Intel MKL and MSVC compiler

1 Answers1