You speak of threads running simultaneously which actually might not be the case if you only have one core in your system. Let's assume that you have more than one.
In the case of multiple devices having access to main memory either in the form of CPUs or bus-mastering or DMA they must be synchronized. This is handled by the lock prefix (implicit for the instruction xchg). It accesses a physical wire on the system bus which essentially signals all devices present to stay away. It is, for example, part of the Win32 function EnterCriticalSection.
So in the case of two cores on the same chip accessing the same position the result would be undefined which may seem strange considering some synchronization should occur since they share the same L3 cache (if there is one). Seems logical, but it doesn't work that way. Why? Because a similar case occurs when you have the two cores on different chips (i e don't have a shared L3 cache). You can't expect them to be synchronized. Well you can but consider all the other devices having access to main memory. If you plan to synchronize between two CPU chips you can't stop there - you have to perform a full-blown synchronization that blocks out all devices with access and to ensure a successful synchronization all the other devices need time to recognize that a synchronization has been requested and that takes a long time, especially if a device has been granted access and is performing a bus-mastering operation which must be allowed to complete. The PCI bus will perform an operation every 0.125 us (8 MHz) and considering that your CPUs run at 400 times you're looking at A LOT of wait states. Then consider that several PCI clock cycles might be required.
You could argue that a medium type (memory bus only) lock should exist but this means an additional pin on every processor and additional logic in every chipset just to handle a case which is really a misunderstanding on the programmer's part. So it's not implemented.
To sum it up: a generic synchronization that would handle your situation would render your PC useless due to it always having to wait for the last device to check in and ok the synchronization. It is a better solution to let it be optional and only insert wait states when the developer has determined that it is absolutely necessary.
This was so much fun that I played a little with the example code and added spinlocks to see what would happen. The spinlock components were
// prototypes
char spinlock_failed (spinlock *);
void spinlock_leave (spinlock *);
// application code
while (spinlock_failed (&sl)) ++n;
++num;
spinlock_leave (&sl);
while (spinlock_failed (&sl)) ++n;
--num;
spinlock_leave (&sl);
spinlock_failed was constructed around the "xchg mem,eax" instruction. Once it failed (at not setting the spinlock <=> succeeded at setting it) spinlock_leave would just assign to it with "mov mem,0". The "++n" counts the total number of retries.
I changed the loop to 2.5 million (because with two threads and two spinlocks per loop I get 10 million spinlocks, nice and easy to round with) and timed the sequences with the "rdtsc" count on a dual-core Athlon II M300 @ 2GHz and this is what I found
- Running one thread without timing
(except for the main loop) and locks
(as in the original example) 33748884
<=> 16.9 ms => 13.5 cycles/loop.
- Running one thread i e no other core
trying took 210917969 cycles <=>
105.5 ms => 84,4 cycles/loop <=> 0.042 us/loop. The spinlocks required 112581340 cycles <=> 22.5 cycles per
spinlocked sequence. Still, the
slowest spinlock required 1334208
cycles: that's 667 us or only 1500
every second.
So, the additon of spinlocks unaffected by another CPU added several hundred percent to the total execution time. The final value in num was 0.
- Running two threads without spinlocks
took 171157957 cycles <=> 85.6 ms =>
68.5 cycles/loop. Num contained 10176.
- Two threads with spinlocks took
4099370103 <=> 2049 ms => 1640
cycles/loop <=> 0.82 us/loop. The
spinlocks required 3930091465 cycles
=> 786 cycles per spinlocked sequence. The slowest spinlock
required 27038623 cycles: thats 13.52
ms or only 74 every second. Num
contained 0.
Incidentally the 171157957 cycles for two threads without spinlocks compares very favorably to two threads with spinlocks where the spinlock time has been removed: 4099370103-3930091465 = 169278638 cycles.
For my sequence the spinlock competition caused 21-29 million retries per thread which comes out to 4.2-5.8 retries per spinlock or 5.2-6.8 tries per spinlock. Addition of spinlocks caused an execution time penalty of 1927% (1500/74-1). The slowest spinlock required 5-8% of all tries.