12

I have a fairly complex program that runs into strange behavior when build with OpenMP in MSVC 2010 Debug mode. I have tried my best to construct the following minimal working example (though it is not really minimal) which minic the structure of the real program.

#include <vector>
#include <cassert>

// A class take points to the whole collection and a position Only allow access
// to the elements at that posiiton. It provide read-only access to query some
// information about the whole collection
class Element
{
    public :

    Element (int i, std::vector<double> *src) : i_(i), src_(src) {}

    int i () const {return i_;}
    int size () const {return src_->size();}

    double src () const {return (*src_)[i_];}
    double &src () {return (*src_)[i_];}

    private :

    const int i_;
    std::vector<double> *const src_;
};

// A Base class for dispatch
template <typename Derived>
class Base
{
    protected :

    void eval (int dim, Element elem, double *res)
    {
        // Dispatch the call from Evaluation<Derived>
        eval_dispatch(dim, elem, res, &Derived::eval); // Point (2)
    }

    private :

    // Resolve to Derived non-static member eval(...)
    template <typename D>
    void eval_dispatch(int dim, Element elem, double *res,
            void (D::*) (int, Element, double *))
    {
#ifndef NDEBUG // Assert that this is a Derived object
        assert((dynamic_cast<Derived *>(this)));
#endif
        static_cast<Derived *>(this)->eval(dim, elem, res);
    }

    // Resolve to Derived static member eval(...)
    void eval_dispatch(int dim, Element elem, double *res,
            void (*) (int, Element, double *))
    {
        Derived::eval(dim, elem, res); // Point (3)
    }

    // Resolve to Base member eval(...), Derived has no this member but derived
    // from Base
    void eval_dispatch(int dim, Element elem, double *res,
            void (Base::*) (int, Element, double *))
    {
        // Default behavior: do nothing
    }
};

// A middle-man who provides the interface operator(), call Base::eval, and
// Base dispatch it to possible default behavior or Derived::eval
template <typename Derived>
class Evaluator : public Base<Derived>
{
    public :

    void operator() (int N , int dim, double *res)
    {
        std::vector<double> src(N);
        for (int i = 0; i < N; ++i)
            src[i] = i;

#pragma omp parallel for default(none) shared(N, dim, src, res)
        for (int i = 0; i < N; ++i) {
            assert(i < N);
            double *r = res + i * dim;
            Element elem(i, &src);
            assert(elem.i() == i); // Point (1)
            this->eval(dim, elem, r);
        }
    }
};

// Client code, who implements eval
class Implementation : public Evaluator<Implementation>
{
    public :

    static void eval (int dim, Element elem, double *r)
    {
        assert(elem.i() < elem.size()); // This is where the program fails Point (4)
        for (int d = 0; d != dim; ++d)
            r[d] = elem.src();
    }
};

int main ()
{
    const int N = 500000;
    const int Dim = 2;
    double *res = new double[N * Dim];
    Implementation impl;
    impl(N, Dim, res);
    delete [] res;

    return 0;
}

The real program does not have vector etc. But the Element, Base, Evaluator and Implementation captures the basic structure of the real program. When build in Debug mode, and running the debugger, the assertion fails at Point (4).

Here is some more details of the debug informations, by viewing the call stacks,

At entering Point (1), the local i has value 371152, which is fine. The variable elem does not shown up in the frame, which is a little strange. But since the assertion at Point (1) does not faile, I guess it is fine.

Then, crazy things happened. The call to eval by Evaluator resolves to its base class, and so Point (2) was exectuted. At this point, the debugers shows that the elem has i_ = 499999, which is no longer the i used to create elem in Evaluator before passing it by value to Base::eval. The next point, it resolves to Point (3), this time, elem has i_ = 501682, which is out of range, and this is the value when the call is directed to Point (4) and failed the assertion.

It looks like whenever Element object is passed by value, the value of its members are changed. Rerun the program multiple times, similar behaviors happens though not always reproducible. In the real program, this class is designed to like an iterator, which iterate over a collection of particles. Though the thing it iterate is not exaclty like a container. But anyway, the point is that it is small enough to be efficiently passed by value. And therefore, the client code, knows that it has its own copy of Element instead of some reference or pointer, and does not need to worry about thread-safe (much) as long as he sticks with Element's interface, which only provide write access to a single position of the whole collection.

I tried the same program with GCC and Intel ICPC. Nothing un-expected happens. And in the real program, correct results where produced.

Did I used OpenMP wrongly somewhere? I thought that the elem created at about Point (1) shall be local to the loop body. In addition, in the whole program, no value bigger than N was produced, so where does the those new value comes from?

Edit

I looked more carefully into the debugger, it shows that while elem.i_ was changed when elem was passed by value, the pointer elem.src_ does not change with it. It has the same value (of the memory address) after passed by value

Edit: Compiler flags

I used CMake to generate the MSVC solution. I have to confess I have no idea how to use MSVC or Windows in general. The only reason I am using it is that I know a lot of people use it so I want to test my library against it to workaround any problems.

The CMake generated project, using Visual Studio 10 Win64 target, the compiler flags appears to be /DWIN32 /D_WINDOWS /W3 /Zm1000 /EHsc /GR /D_DEBUG /MDd /Zi /Ob0 /Od /RTC1 And here is the command line found in Property Pages-C/C++-Command Line /Zi /nologo /W3 /WX- /Od /Ob0 /D "WIN32" /D "_WINDOWS" /D "_DEBUG" /D "CMAKE_INTDIR=\"Debug\"" /D "_MBCS" /Gm- /EHsc /RTC1 /MDd /GS /fp:precise /Zc:wchar_t /Zc:forScope /GR /openmp /Fp"TestOMP.dir\Debug\TestOMP.pch" /Fa"Debug" /Fo"TestOMP.dir\Debug\" /Fd"C:/Users/Yan Zhou/Dropbox/Build/TestOMP/build/Debug/TestOMP.pdb" /Gd /TP /errorReport:queue

Is there anything suspecious here?

Yan Zhou
  • 2,709
  • 2
  • 22
  • 37
  • Sometimes strange things can happen when some code is compiled as Release and some is compiled as Debug. Is the OpenMP you're using compiled with the same flags/debug stuff as your program? – Seth Carnegie Jul 13 '12 at 23:04
  • I am not sure about the question. I don't usually use msvc except for testing. However the code above was a single file program. So I guess whatever flag is used, it is used for the whole program. Is there special option for debug mode openmp? I used cmake to find openmp flag, which turns put to be /openmp. @SethCarnegie – Yan Zhou Jul 13 '12 at 23:15
  • are you compiling OpenMP with that file, or using a library that was compiled at another time? – Seth Carnegie Jul 13 '12 at 23:16
  • No library involved in the MWE. Just that a single file containing the code in the question and compiled with openmp flag. And this test program fails the same way as the real program. In the real program, only header only libraries was used. So basically all program was compiled at the time with the same flag. The only thing compiled at another time is the MSVC openmp runtime – Yan Zhou Jul 13 '12 at 23:19
  • MSVC runtime implements OpenMP parallel regions using thread pools. I could be that the debugger is not happy with that. – Hristo Iliev Jul 14 '12 at 09:39
  • But even without the debugger, just run the program in a powershell, or MinGW shell, it crashes as well, as long as the program is built with in Debug mode. Further, if I built it with any optimization though not define `NDEBUG`, such that the `assert` is still tested, the program does not crash – Yan Zhou Jul 14 '12 at 09:46
  • What's more confusing to me is that, since the when `elem` being copied, its `i_` member changes at each of the first two passing by value. If it is due to some thread-unsafety or memory inconsistency, then how do we explain that the second member, the pointer `scr_` does not suffer the same problem? – Yan Zhou Jul 14 '12 at 09:47
  • Sorry, I got your test program in my MSVC 2010 and it compiles and debugs perfectly in Debug mode with OpenMP enabled, both with x86 and x64 targets. I am unable to reproduce the failed assertion in (4) as well as the wrong value of `i_` in copied objects. Could you provide more information on what changes you make to the default project settings besides enabling OpenMP? – Hristo Iliev Jul 17 '12 at 08:20
  • @HristoIliev Thans for trying this. I have edited the question with regard of some additonal information of the compiler flags – Yan Zhou Jul 17 '12 at 08:40
  • I have updated the answer with two possible solutions to your problem. – Hristo Iliev Jul 18 '12 at 12:09

1 Answers1

9

Apparently the 64-bit OpenMP implementation in MSVC is not compatible with code, compiled without optimisations.

To debug your issue, I've modified your code to save the iteration number to a threadprivate global variable just before the call to this->eval() and then added a check at the beginning of Implementation::eval() to see if the saved iteration number differs from elem.i_:

static int _iter;
#pragma omp threadprivate(_iter)

...
#pragma omp parallel for default(none) shared(N, dim, src, res)
    for (int i = 0; i < N; ++i) {
        assert(i < N);
        double *r = res + i * dim;
        Element elem(i, &src);
        assert(elem.i() == i); // Point (1)
        _iter = i;             // Save the iteration number
        this->eval(dim, elem, r);
    }
}
...

...
static void eval (int dim, Element elem, double *r)
{
    // Check for difference
    if (elem.i() != _iter)
        printf("[%d] _iter=%x != %x\n", omp_get_thread_num(), _iter, elem.i());
    assert(elem.i() < elem.size()); // This is where the program fails Point (4)
    for (int d = 0; d != dim; ++d)
        r[d] = elem.src();
}
...

It appears that randomly the value of elem.i_ becomes a bad mixture of the values passed in the different threads to void eval_dispatch(int dim, Element elem, double *res, void (*) (int, Element, double *)). This happens hunderds of times in each run but you only see it once the value of elem.i_ becomes large enough to trigger the assertion. Sometimes it happens that the mixed value does not exceed the size of the container and then the code completes execution without assertion. Also what you see during the debug session after the assertion is the inability of the VS debugger to cope properly with the multithreaded code :)

This only happens in unoptimised 64-bit mode. It doese not happen in 32-bit code (both debug and release). It also does not happen in 64-bit release code unless optimisations are disabled. It also does not happen if one puts the call to this->eval() in a critical section:

#pragma omp parallel for default(none) shared(N, dim, src, res)
    for (int i = 0; i < N; ++i) {
        ...
#pragma omp critical
        this->eval(dim, elem, r);
    }
}

but doing this would cancel the benefits of OpenMP. This shows that something further down the call chain is performed in unsafe way. I've examined the assembly code but couldn't find the exact reason. I'm really puzzled since MSVC implements the implicit copy constructor of the Element class using simple bitwise copy (it is even inline) and all operations are done on the stack.

This reminds me of the fact that the Sun's (now Oracle's) compiler insists that it should crank up the level of optimisation if one enables OpenMP support. Unfortunately the documentation of /openmp option in MSDN says nothing about possible intereference that might come from the "wrong" optimisation level. This also might be a bug. I should test with another version of VS if I can access one.

Edit: I dug deeper as promised and run the code in Intel Parallel Inspector 2011. It found one data race pattern as expected. Apparently when this line is executed:

this->eval(dim, elem, r);

a temporary copy of elem is created and passed by address to the eval() method as is required by the Windows x64 ABI. And here comes the strange thing: the location of this temporary copy is not on the stack of the funclet that implements the parallel region (MSVC compiler calles it Evaluator$omp$1<Implementation>::operator() by the way) as one would expect but rather its address is taken as the first argument of the funclet. As this argument is one and the same in all threads, it means that the temporary copy that gets further passed to this->eval() is actually shared among all threads, which is ridiculous but is still true as one can easily observe:

...
void eval (int dim, Element elem, double *res)
{
    printf("[%d] In Base::eval()    &elem = %p\n", omp_get_thread_num(), &elem);
    // Dispatch the call from Evaluation<Derived>
    eval_dispatch(dim, elem, res, &Derived::eval); // Point (2)
}
...

...
#pragma omp parallel for default(none) shared(N, dim, src, res)
    for (int i = 0; i < N; ++i) {
        ...
        Element elem(i, &src);
        ...
        printf("[%d] In parallel region &elem = %p\n", omp_get_thread_num(), &elem);
        this->eval(dim, elem, r);
    }
}
...

Running this code produces an output similar to this:

[0] Parallel region &elem = 000000000030F348 (a)
[0] Base::eval()    &elem = 000000000030F630
[0] Parallel region &elem = 000000000030F348 (a)
[0] Base::eval()    &elem = 000000000030F630
[1] Parallel region &elem = 000000000292F9B8 (b)
[1] Base::eval()    &elem = 000000000030F630 <---- !!
[1] Parallel region &elem = 000000000292F9B8 (b)
[1] Base::eval()    &elem = 000000000030F630 <---- !!

As expected elem has different addresses in each thread executing the parallel region (points (a) and (b)). But observe that the temporary copy that gets passed to Base::eval() has the same address in each thread. I believe that this is a compiler bug that makes the implicit copy constructor of Element use a shared variable. This could be easily validated by looking at the address passed to Base::eval() - it lies somewhere between the address of N and that of src, i.e. in the shared variables block. Further inspection of the assembly source reveals that indeed the address of the temporary place is passed as argument to the _vcomp_fork() function from vcomp100.dll that implements the fork part of the OpenMP fork/join model.

Since basically there are no compiler options that can influence this behaviour apart from enabling optimisations which leads to Base::eval(), Base::eval_dispatch(), and Implementation::eval() all being inlined and hence no temporary copies of elem are ever made, the only work-arounds that I have found are:

1) Make the Element elem argument to Base::eval() a reference:

void eval (int dim, Element& elem, double *res)
{
    eval_dispatch(dim, elem, res, &Derived::eval); // Point (2)
}

This ensures that the local copy of elem in the stack of the funclet that implements the parallel region in Evaluator<Implementation>::operator() is passed and not the shared temporary copy. This gets further passed by value as another temporary copy to Base::eval_dispatch() but it retains its correct value as this new temporary copy is in the stack of Base::eval() and not in the shared variables block.

2) Provide an explicit copy constructor to Element:

Element (const Element& e) : i_(e.i_), src_(e.src_) {}

I would recommend that you go with the explicit copy constructor as it does not require further changes in the source code.

Apparently this behaviour is also present in MSVS 2008. I would have to check if it is also present in MSVS 2012 and possibly file a bug report with MS.

This bug does not show in 32-bit code as there the whole value of each passed by value object is pushed on the call stack and not only a pointer to it.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Thanks for the answer. If I got your idea correctly, you basically observed the same behaviour as I did and we may conclude that this is some problem with MSVC. Regarding the critical region, I also tried to use only one OMP thread and no problems happened. However as far as I can tell, the simple problem does not have any thread-safety problem, also I cannot see any possible race condition. – Yan Zhou Jul 17 '12 at 20:50
  • Yes, basically the problem was there but I did not observe failing assertion the first several times as it doesn't happen all the time. Turning on optimisations solves it. I will still try to reach to the bottom of the problem when time permits as it could also happen to some of our users. – Hristo Iliev Jul 17 '12 at 21:00
  • Thanks for the updated answer. It is very informative and helpful. I think I shall go with the explicit copy constructor. But I am a little concerned if there will be any performance impact. In the actual program, the Element is designed like an iterator and passing it by value efficiently is essential. And I sort of count on fast bit wise copy – Yan Zhou Jul 18 '12 at 12:25
  • In release mode harldy and copy is made as most function calls get inlined. – Hristo Iliev Jul 18 '12 at 12:36