1

I'm writing a C++ multi-threaded code. When testing the overhead of different mutex lock I found that the thread unsafe code seem to yield the correct result compiled with Release Configuration in Visual Studio but much faster than the code with mutex lock. However with Debug Configuration the result is what I expected. I was wondering if it's the compiler that solved this or it's just because the code compiled in Release configuration runs so fast that two threads never accesses the memory in the same time?

My test code is pasted as following.

class Mutex {
public:
unsigned long long  _data;

bool tryLock() {
    return mtx.try_lock();
}

inline void Lock() {
    mtx.lock();
}
inline void Unlock() {
    mtx.unlock();
}
void safeSet(const unsigned long long &data) {
    Lock();
    _data = data;
    Unlock();
}
Mutex& operator++ () {
    Lock();
    _data++;
    Unlock();
    return (*this);
}
Mutex operator++(int) {
    Mutex tmp = (*this);
    Lock();
    _data++;
    Unlock();
    return tmp;
}
Mutex() {
    _data = 0;
}
 private:
std::mutex mtx;
Mutex(Mutex& cpy) {
    _data = cpy._data;
}
}val;

static DWORD64 val_unsafe = 0;
DWORD WINAPI safeThreads(LPVOID lParam) {
for (int i = 0; i < 655360;i++) {
    ++val;
}
return 0;
}
DWORD WINAPI unsafeThreads(LPVOID lParam) {
for (int i = 0; i < 655360; i++) {
    val_unsafe++;
}
return 0;
}

int main()
{
val._data = 0;
vector<HANDLE> hThreads;
LARGE_INTEGER freq, time1, time2;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&time1);
for (int i = 0; i < 32; i++) {
    hThreads.push_back( CreateThread(0, 0, safeThreads, 0, 0, 0));
}
for each(HANDLE handle in hThreads)
{
    WaitForSingleObject(handle, INFINITE);
}
QueryPerformanceCounter(&time2);
cout<<time2.QuadPart - time1.QuadPart<<endl;
hThreads.clear();
QueryPerformanceCounter(&time1);

for (int i = 0; i < 32; i++) {
    hThreads.push_back(CreateThread(0, 0, unsafeThreads, 0, 0, 0));
}
for each(HANDLE handle in hThreads)
{
    WaitForSingleObject(handle, INFINITE);
}
QueryPerformanceCounter(&time2);
cout << time2.QuadPart - time1.QuadPart << endl;

hThreads.clear();
cout << val._data << endl << val_unsafe<<endl;
cout << freq.QuadPart << endl;
return 0;
}
Christophe
  • 68,716
  • 7
  • 72
  • 138
Bill Sun
  • 51
  • 5

1 Answers1

6

The standard doesn't let you assume that code is thread safe by default. Your code gives nevertheless correct result when compiled in release mode for x64.

But why ?

If you look at the assembler file generated for your code, you'll find out that that the optimizer simply unrolled the loop and applied constant propagation. So instead of looping 65535 times, it just adds a constant to your counter :

?unsafeThreads@@YAKPEAX@Z PROC              ; unsafeThreads, COMDAT
; 74   :    for (int i = 0; i < 655360; i++) {
    add QWORD PTR ?val_unsafe@@3_KA, 655360 ; 000a0000H   <======= HERE 
; 75   :        val_unsafe++;
; 76   :    }
; 77   :    return 0;
    xor eax, eax                             
; 78   : }

In this situation, with a single and very fast instruction in each thread, it's much less probable to get a data race : most probably one thread is already finished before the next is launched.

How to see the expected result from your benchmark ?

If you want to avoid the optimizer to unroll your test loops, you need to declare _data and unsafe_val as volatile. You'll then notice that the unsafe value is no longer correct due to data races. Running my own tests with this modified code, I get the correct value for the safe version, and always different (and wrong) values for the unsafe version. For example:

safe time:5672583
unsafe time:145092                   // <=== much faster
val:20971520
val_unsafe:3874844                   // <=== OUCH !!!!
freq: 2597654

Want to make your unsafe code safe ?

If you want to make your unsafe code safe but without using an explicit mutex, you could just make unsafe_val an atomic. The result will be platform dependent (the implementation could very well introduce a mutex for you) but on the same machine than above, with MSVC15 in release mode, I get:

safe time:5616282
unsafe time:798851                    // still much faster (6 to 7 times in average)
val:20971520
val_unsafe:20971520                   // but always correct
freq2597654

The only thing that you then still must do : rename the atomic version of the variable from unsafe_val into also_safe_val ;-)

Christophe
  • 68,716
  • 7
  • 72
  • 138
  • It would be interesting to compare the assembly code for explicitly using a mutex vs using atomic on various platforms. – mgarey Jan 01 '17 at 16:50
  • @mgarey indeed. The atomic version uses an atomic [LOCK XADD instruction](http://stackoverflow.com/a/8891781/3723423), whereas the mutex version relies on 2 library call ( `__imp__Mtx_lock`/`__imp__Mtx_unlock`), each requiring to load the address of the mutex as parameter to the call, so it's clearly a couple of instructions more (in this case). – Christophe Jan 01 '17 at 17:18
  • That explains the speed difference for me. I suspect that if the atomic code was more complex than a simple add instruction (doing some more computations, perhaps modifying or reading a shared resource), then mutexes would have to be used. – mgarey Jan 01 '17 at 17:21
  • @mgarey Yes, certainly. In fact it's not even guaranteed for `long long`on all platforms. This is why the standard offers an [`is_lock_free()`](http://www.cplusplus.com/reference/atomic/atomic/is_lock_free/) to find out: it's true for long long on MSVC2015 x64, but it seems false for little larger structs such as for example 10 bytes. – Christophe Jan 01 '17 at 17:41