Low-level atomic operations. These are essentially mutexes implemented in hardware, except you can only perform a very few operations atomically.
Consider the following equivalent pseudocode:
mutex global_mutex;
void InterlockedAdd(int& dest, int value) {
scoped_lock lock(mutex);
dest += value;
}
int InterlockedRead(int& src) {
scoped_lock lock(mutex);
return src;
}
void InterlockedWrite(int& dest, int value) {
scoped_lock lock(mutex);
dest = value;
}
These functions are implemented as instructions by the CPU, and they guarantee consistency between threads to various degrees. The exact semantics depend upon the CPU in question. x86 offers sequential consistency. This means that the operations act as if they were issued sequentially, in some order. This obviously involves blocking a little.
You may accurately surmise that atomic operations can be implemented in terms of mutexes, or vice versa. But usually, atomic ops are provided by hardware, and then mutexes and other synchronization primitives implemented on top of them by the operating system. This is because there are some algorithms that do not require a full mutex and can operate what is known as "locklessly", which means just using atomic operations for some inter-thread consistency.