First you need to decide what it is you're trying to achieve, and what ways possible the instructions of different threads can interlace an prevent that happening.
The incrementation operator in C ++x
is usually implemented as read the value of i from memory into an register; increment the register; write the value to memory:
r1 ← xglobal
r1 ← r1 + 1
xglobal ← r1
So the value of xglobal is incremented by one.
If you have two threads in parallel, then they can interlace destructively
initial xglobal = 99
r1 ← xglobal
compare r1 100
r2 ← xglobal
compare r2 100
r1 ← r1 + 1 == 100
r2 ← r2 + 1 == 100
xglobal ← r1 == 100
xglobal ← r2 == 100
r1 ← xglobal
compare r1 100
stop
r2 ← xglobal
compare r1 100
stop
final xglobal = 100
So the value of xglobal is incremented by one, despite both threads incrementing it.
( I'm eliding the effects of caching, which mean that the read of a variable in thread 2 can behave as though it took place before a write in thread 1 even if the write by thread 1 happens before the read in by wall clock time. Acquiring and releasing of mutexes in pthreads cause memory barriers which force all reads to behave as though they happened after and writes to behave as if they happened before the acquire or release. )
( the above is equivalent of for ( int r; ( r = x ) < 100; x = r + 1 )
rather than for (; x < 100; x = x + 1 )
which may have an extra read of x and so has another point where threads can interfere )
Similarly, an increment by one thread can destroy the increment of another thread allowing threads to end with i < 100:
initial xglobal = 98
r2 ← xglobal
r1 ← xglobal
compare r1 100
r1 ← r1 + 1 == 99
xglobal ← r1
r1 ← xglobal
compare r1 100
r1 ← r1 + 1 == 100
xglobal ← r1
r1 ← xglobal
compare r1 100
stop
compare r2 100
r2 ← r2 + 1 == 99
xglobal ← r2
...
final xglobal = 99
So the second increment by the left thread is overwritten by the increment by the first, and it would terminate with the global visible value of x < 100.
You probably know all that, and may want to use a mechanism to protect against it.
I say 'may' as your requirements aren't clear - the thread above terminated when x reached 100; the requirements don't say it doesn't say there.
So, since no thread will terminate without writing xglobal ← 100, the requirement may in fact be met without any locking, but x
may be incremented n
*100 times rather than 100 times. ( if the limit was larger than a byte then the writing of x might be non-atomic on some platforms, which could result in an infinite loop if bytes from different threads are mixed together, but for a limit of 100 that won't happen )
One technique is to use a mutex which blocks other threads from running when one thread holds the lock on the mutex. If the lock is acquired before xglobal is read, and not released until xglobal is written, then the reads and writes of the thread cannot interlace.
initial xglobal = 98
lock (mutex)
mutex locked
lock(mutex)
blocked
r1 ← xglobal
compare r1 100
r1 ← r1 + 1 == 99
xglobal ← r1
release ( mutex )
mutex locked
r2 ← xglobal
compare r2 100
r2 ← r2 + 1 == 100
xglobal ← r2
release ( mutex )
lock (mutex)
mutex locked
r1 ← xglobal
compare r1 100
release ( mutex )
stop
...
final xglobal = 100
Outside of pthreads, you might want to use your platform's compare-and-swap operation ( __sync_val_compare_and_swap
in gcc ) which takes an address an old value and a new value, and atomically sets the memory at the address to the new value if it was equal to the old value. This lets you write the logic as:
for ( int v = 0; v < 100; ) {
int x_prev = __sync_val_compare_and_swap ( &x, v, v + 1 );
// if the CAS succeeds, the value of x has been set to is x_prev + 1
// otherwise, try again from current last value
if ( x_prev == v )
v = x_prev + 1;
else
v = x_prev;
}
So if
initial xglobal = 98
initial v1 = 0
initial v2 = 0
cmp v1 100
x_prev1 ← CASV ( xglobal, v1, v1 + 1 ) = 98 ( set fails with x == 98 )
cmp v2 100
x_prev2 ← CASV ( xglobal, v1, v1 + 1 ) = 98 ( set fails with x == 98 )
v1 ← x_prev1 = 98 // x_prev != v
v2 ← x_prev2 = 98
cmp v2 100
x_prev2 ← CASV ( xglobal, v1, v1 + 1 ) = 98 ( set succeeds with x == 99 )
v2 ← x_prev2 + 1 = 99 // as x_prev == v
cmp v1 100
x_prev1 ← CASV ( xglobal, v1, v1 + 1 ) = 99 ( set fails with x == 99 )
v1 ← x_prev1 = 99 // as x_prev != v
cmp v1 100
x_prev1 ← CASV ( xglobal, v1, v1 + 1 ) = 99 ( set succeeds with x == 100)
v1 ← x_prev1 + 1 = 100 // as x_prev == v
cmp v2 100
x_prev2 ← CASV ( xglobal, v1, v1 + 1 ) = 100 ( set fails with x == 100 )
v2 ← x_prev2 = 100 // as x_prev != v
cmp v1 100
cmp v2 100
stop
stop
On each loop, xglobal will atomically be set to the value of r1 + 1 if and only if its previous value was r1; if not, r1 will be set to the value of xglobal tested during the CASV operation. This reduces the amount of time locks are held on most implementations ( though it still requires locking the memory bus for the duration of the CAS operation, only those operations will be serialised. As performing the CAS is expensive on multi-cores, it probably won't be much better for such a simple case as this. )