C , C++ unsynchronized threads returning a strange result

Question

Okay, i have this question in one regarding threads.

there are two unsynchronized threads running simultaneously and using a global resource "int num" 1st:

    void Thread()
{ 
    int i;
    for ( i=0 ; i < 100000000; i++ )
    {
        num++;
        num--;
    }
}

2nd:

    void Thread2()
{ 
    int j;
    for ( j=0 ; j < 100000000; j++ )
    {
        num++;
        num--;      
    }
}

The question states: what are the possible values of the variable "num" at the end of the program. now i would say 0 will be the value of num at the end of the program but, try and run this code and you will find out that the result is quite random, and i can't understand why?

The full code:

 #include <windows.h>
    #include <process.h>
    #include <stdio.h>

    int static num=0;

   void Thread()
    { 
        int i;
        for ( i=0 ; i < 100000000; i++ )
        {
            num++;
            num--;
        }
    }

   void Thread2()
    { 
        int j;
        for ( j=0 ; j < 100000000; j++ )
        {
            num++;
            num--;      
        }
    }

    int main()
    {
        long handle,handle2,code,code2;
        handle=_beginthread( Thread, 0, NULL );
        handle2=_beginthread( Thread2, 0, NULL );

        while( (GetExitCodeThread(handle,&code)||GetExitCodeThread(handle2,&code2))!=0 );

        TerminateThread(handle, code );
        TerminateThread(handle2, code2 );

        printf("%d ",num);
        system("pause"); 
    }

@Marlon: Not true. See @Thomas's answer. – Falmarri Feb 06 '11 at 09:14 — Falmarri, Feb 06 '11 at 09:14

Thomas · Accepted Answer · 2011-02-06T09:20:51.913

21

num++ and num-- don't have to be atomic operations. To take num++ as an example, this is probably implemented like:

int tmp = num;
tmp = tmp + 1;
num = tmp;

where tmp is held in a CPU register.

Now let's say that num == 0, both threads try to execute num++, and the operations are interleaved as follows:

Thread A        Thread B
int tmp = num;
tmp = tmp + 1;
                int tmp = num;
                tmp = tmp + 1;
num = tmp;
                num = tmp;

The result at the end will be num == 1 even though it should have been incremented twice. Here, one increment is lost; in the same way, a decrement could be lost as well.

In pathological cases, all increments of one thread could be lost, resulting in num == -100000000, or all decrements of one thread could be lost, resulting in num == +100000000. There may even be more extreme scenarios lurking out there.

Then there's also other business going on, because num isn't declared as volatile. Both threads will therefore assume that the value of num doesn't change, unless they are the one changing it. This allows the compiler to optimize away the entire for loop, if it feels so inclined!

edited Feb 06 '11 at 09:20

answered Feb 06 '11 at 09:13

Thomas

174,939
50
355
478

Yep, see also http://stackoverflow.com/questions/3122382/using-volatile-long-as-an-atomic – wmeyer Feb 06 '11 at 09:15
2

`volatile` does *NOT* have atomic access semantics in C and C++. – Axel Gneiting Feb 06 '11 at 11:08
7

Of course not. Did I imply that it does? I merely stated that the absence of `volatile` makes the program even more wrong than it already is. – Thomas Feb 06 '11 at 11:18
@axel no but a compiler might assume that it can hold the variable in a register rather than writing it back to memory until absolutely necessary. volatile tells the compiler to always write any assignment to memory, which helps, even if it is not atomic access. – Feb 06 '11 at 11:23
1

Just because it's slightly less likely to fail with volatile, it doesn't make things better. – Axel Gneiting Feb 06 '11 at 11:34
2

@Axel the argument is that fixing the problem requires using `volatile` *in addition to* doing proper synchronization (assuming that we care about the value of `i` during the process rather than just at the end, and assuming that we care about having a program that actually performs the work described by the two loops). Synchronizing it properly and leaving out `volatile` might still be no good because of the optimizations. (Unless, of course, the presence of the calls to synchronization primitives interferes with those optimizations - which I imagine it will, actually.) – Karl Knechtel Feb 06 '11 at 17:42
2

@Karl: The synchronisation primitives must imply a compiler barrier, if they are to be correct, so `volatile` is unnecessary. – caf Feb 06 '11 at 22:36
@caf: depends on what synchronization primitives you use. If you use an atomic increment/decrement, volatile is not needed. If you use a monitor or citical section on the other hand, volatile is needed. – Chris Dodd Feb 07 '11 at 17:19
Considering that num is declared outside the two threads I don't see it as obvious that the variable produced would be placed in a register in the manner shown in your answer. It might just as well be "inc num" or "add num,1" (on x86) or the like. Your example might just as well have resulted in the compiler skipping the loop since, in the compiler's "mind," one ++ and one -- cancel each other out which it didn't conclude. A bus lock might be applied to one of my example instructions to atomize the operation but that would slow the app to a snail's pace (on the order of 10-100X, at least). – Olof Forshell Feb 07 '11 at 18:28
@Chris: All reasonable implementations of monitors and critical sections are done such that volatile is not necessary. – tony Feb 08 '11 at 05:05
Olof, true, even with locks, a good compiler would probably optimize num++; num--; away as a no op. Which is actually fine in this case. Of course, declaring num as volatile will force the compiler to perform both the increment and decrement. – Feb 12 '11 at 19:16

R.. GitHub STOP HELPING ICE · Answer 2 · 2011-02-06T11:51:59.740

2

The possible values for num include all possible int values, plus floating point values, strings, and jpegs of nasal demons. Once you invoke undefined behavior, all bets are off.

More specifically, modifying the same object from multiple threads without synchronization results in undefined behavior. On most real-world systems, the worst effects you see will probably be missing or double increments or decrements, but it could be much worse (memory corruption, crashing, file corruption, etc.). So just don't do it.

The next upcoming C and C++ standards will include atomic types which can be safely accessed from multiple threads without any synchronization API.

edited Feb 06 '11 at 11:51

answered Feb 06 '11 at 11:02

R.. GitHub STOP HELPING ICE

208,859
35
376
711

The existing answer did not mention undefined behavior whatsoever. – R.. GitHub STOP HELPING ICE Feb 06 '11 at 11:30
3

Technically, I'm not sure this even qualifies as UB, since the (current) C++ standard never even mentions threading. But on the "all bets are off" part, you're certainly right. – Thomas Feb 06 '11 at 12:58
I don't know the relevant Windows documentation, but I would imagine MS specifies it as UB. Certainly POSIX specifies it as UB. – R.. GitHub STOP HELPING ICE Feb 06 '11 at 13:52
The moment you create a second thread, you're in UB-land as far as current C++ is concerned... – bdonlan Feb 06 '11 at 13:53
2

@bdonlan: No, you're in implementation-defined behavior, or more usefully, *higher-level-standard-defined* behavior (i.e. behavior which all implementations purporting to conform to a higher-level standard must define). In any case, an answer to a question about threads is not meaningful until you assume some particular threading standard (e.g. POSIX) or implementation (e.g. Windows). – R.. GitHub STOP HELPING ICE Feb 06 '11 at 19:05

Olof Forshell · Answer 3 · 2011-09-12T11:43:10.337

You speak of threads running simultaneously which actually might not be the case if you only have one core in your system. Let's assume that you have more than one.

In the case of multiple devices having access to main memory either in the form of CPUs or bus-mastering or DMA they must be synchronized. This is handled by the lock prefix (implicit for the instruction xchg). It accesses a physical wire on the system bus which essentially signals all devices present to stay away. It is, for example, part of the Win32 function EnterCriticalSection.

So in the case of two cores on the same chip accessing the same position the result would be undefined which may seem strange considering some synchronization should occur since they share the same L3 cache (if there is one). Seems logical, but it doesn't work that way. Why? Because a similar case occurs when you have the two cores on different chips (i e don't have a shared L3 cache). You can't expect them to be synchronized. Well you can but consider all the other devices having access to main memory. If you plan to synchronize between two CPU chips you can't stop there - you have to perform a full-blown synchronization that blocks out all devices with access and to ensure a successful synchronization all the other devices need time to recognize that a synchronization has been requested and that takes a long time, especially if a device has been granted access and is performing a bus-mastering operation which must be allowed to complete. The PCI bus will perform an operation every 0.125 us (8 MHz) and considering that your CPUs run at 400 times you're looking at A LOT of wait states. Then consider that several PCI clock cycles might be required.

You could argue that a medium type (memory bus only) lock should exist but this means an additional pin on every processor and additional logic in every chipset just to handle a case which is really a misunderstanding on the programmer's part. So it's not implemented.

To sum it up: a generic synchronization that would handle your situation would render your PC useless due to it always having to wait for the last device to check in and ok the synchronization. It is a better solution to let it be optional and only insert wait states when the developer has determined that it is absolutely necessary.

This was so much fun that I played a little with the example code and added spinlocks to see what would happen. The spinlock components were

// prototypes

char spinlock_failed (spinlock *);
void spinlock_leave (spinlock *);

// application code

while (spinlock_failed (&sl)) ++n;
++num;
spinlock_leave (&sl);

while (spinlock_failed (&sl)) ++n;
--num;
spinlock_leave (&sl);

spinlock_failed was constructed around the "xchg mem,eax" instruction. Once it failed (at not setting the spinlock <=> succeeded at setting it) spinlock_leave would just assign to it with "mov mem,0". The "++n" counts the total number of retries.

I changed the loop to 2.5 million (because with two threads and two spinlocks per loop I get 10 million spinlocks, nice and easy to round with) and timed the sequences with the "rdtsc" count on a dual-core Athlon II M300 @ 2GHz and this is what I found

Running one thread without timing (except for the main loop) and locks (as in the original example) 33748884 <=> 16.9 ms => 13.5 cycles/loop.
Running one thread i e no other core trying took 210917969 cycles <=> 105.5 ms => 84,4 cycles/loop <=> 0.042 us/loop. The spinlocks required 112581340 cycles <=> 22.5 cycles per spinlocked sequence. Still, the slowest spinlock required 1334208 cycles: that's 667 us or only 1500 every second.

So, the additon of spinlocks unaffected by another CPU added several hundred percent to the total execution time. The final value in num was 0.

Running two threads without spinlocks took 171157957 cycles <=> 85.6 ms => 68.5 cycles/loop. Num contained 10176.
Two threads with spinlocks took 4099370103 <=> 2049 ms => 1640 cycles/loop <=> 0.82 us/loop. The spinlocks required 3930091465 cycles => 786 cycles per spinlocked sequence. The slowest spinlock required 27038623 cycles: thats 13.52 ms or only 74 every second. Num contained 0.

Incidentally the 171157957 cycles for two threads without spinlocks compares very favorably to two threads with spinlocks where the spinlock time has been removed: 4099370103-3930091465 = 169278638 cycles.

For my sequence the spinlock competition caused 21-29 million retries per thread which comes out to 4.2-5.8 retries per spinlock or 5.2-6.8 tries per spinlock. Addition of spinlocks caused an execution time penalty of 1927% (1500/74-1). The slowest spinlock required 5-8% of all tries.

Were your locks spinning on `xchg`, or a load before trying to `xchg`? That's the normal advice, but I'm not sure if it would make much difference with the lock being held for such a short time, and only two threads competing for it. — Peter Cordes, Apr 19 '16 at 18:36

score 0 · Answer 4 · answered Feb 07 '11 at 06:14

0

As Thomas said, the results are unpredictable because your increment and decrement are non-atomic. You can use InterlockedIncrement and InterlockedDecrement -- which are atomic -- to see a predictable result.

Interlocked Variable Access (MSDN)

answered Feb 07 '11 at 06:14

Jeff

2,023
2
14
10

C , C++ unsynchronized threads returning a strange result

4 Answers4

Linked