LOCK prefix of Intel instruction. What is the point?

Question

I read the Intel manual and found there is a lock prefix for instructions, which can prevent processors writing to the same memory location at the same time. I am quite excited about it. I guess it could be used as hardware mutex. So I wrote a piece of code to have a shot. The result is quite frustrating. The lock does not support MOV or LEA instructions. The manual says LOCK only supports ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. What is more, if the LOCK prefix is used with one of these instructions and the source operand is a memory operand, an undefined opcode exception (#UD) may be generated.

I wonder why so many limitations, so many restrictions make LOCK seem useless. I cannot use it to guarantee a general write operation not have dirty data or other problems caused by parallelism.

E.g. I wrote code ++(*p) in C. p is pointer to a shared memory. The corresponding assembly is like:

movl    28(%esp), %eax
movl    (%eax), %eax
leal    1(%eax), %edx
movl    28(%esp), %eax
movl    %edx, (%eax)

I added "lock" before "movl" and "leal", but the processor complains "Invalid Instruction". :-( I guess the only way to make the write operations serialized is to use software mutex, right?

A `movl` to an aligned address is always atomic, so lock would make no difference at all. — Gunther Piez, Jun 16 '12 at 17:44
If these "restrictions" did not exist, the additional uses would all be useless anyway. — harold, Jun 16 '12 at 18:40
BTW, there is a problem with your code: Even if each instruction was carried out with a bus-lock, the code would still not be thread safe, as it could be interrupted between any instruction. You want to have the whole instruction block run by at maximum one thread a a time. You need a completely different approach to this - a mutex or similar. To do this, you basically need to surround the code with locking/unlocking code - which involves instructions like "lock cmpxchg" et al — Gunther Piez, Jun 17 '12 at 08:41
@drhirsch - I don't think aligned movs are *guaranteed* atomic. It's certainly not atomic on the 8088 (for 16-bit) or 386SX (for 32-bit) operations. — Brian Knoblauch, Jun 18 '12 at 13:34
@BrianKnoblauch: They are guaranteed atomic on all processors since 486, to be more specific. I just assumed that we were talking about processors younger than 20 years. See the Intel System Programming Guide Chapter 8.1.1 "The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically: • Reading or writing a byte • Reading or writing a word aligned on a 16-bit boundary • Reading or writing a doubleword aligned on a 32-bit boundary" etc. — Gunther Piez, Jun 18 '12 at 13:47
If you want to use a `lock`ed instruction as just a memory barrier (on a CPU that doesn't support `MFENCE`, or where `MFENCE` is slower than a locked instruction), you can just `lock add $0, %(esp)`, which is otherwise a no-op (except clobbering flags) and does its read-modify-write locked cycle on memory that's very likely hot in L1 cache and NOT on another core. Other than that, this question seems to entirely miss the point of the `lock` prefix. It's for atomic read-modify-writes. The full-memory-barrier is just a side-effect, but x86 didn't have a separate barrier insn until `mfence`. — Peter Cordes, Feb 03 '16 at 18:15
Possible duplicate of [Is x86 CMPXCHG atomic?](https://stackoverflow.com/questions/27837731/is-x86-cmpxchg-atomic) — Evan Carroll, Oct 19 '18 at 19:48
@PeterCordes should this be dupped hammered, seems we have a lot of these also here https://stackoverflow.com/questions/17020128/why-we-need-lock-prefix-before-cmpxchg — Evan Carroll, Oct 20 '18 at 20:08
@EvanCarroll: This isn't a duplicate: it seems to be asking what the point of `lock` is if it can't be used to make a `mov` atomic. (But it's totally mixed up about making the whole RMW atomic vs. making one `mov` atomic.) It's not asking about CMPXCHG; if anything the instruction they want is `lock add` or `lock xadd`. But yes, I duphammered your 2nd link, like I tried to years ago before I had the rep to do it alone. — Peter Cordes, Oct 20 '18 at 22:05

score 12 · Accepted Answer · answered Jun 16 '12 at 17:45

12

I certainly would not call lock useless. lock cmpxchg is the standard way to perform compare-and-swap, which is the basic building block of many synchronization algorithms.

Also, see fetch-and-add.

answered Jun 16 '12 at 17:45

NPE

486,780
108
951
1,012

score 5 · Answer 2 · answered Jun 16 '12 at 17:45

5

The purpose of lock is to make operations atomic, not serialized. In this way the CPU cannot be preempted before the operation takes effect.

answered Jun 16 '12 at 17:45

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Thanks for your response. I am not very sure about these terminologies. I guess atomic means the operation will be done in a whole, no interruption, otherwise canceled. In this problem, even if "lock add" is done in an atomic manner, if does not mean other processor cannot access that memory location stealthily at the same time. So what "lock" is doing is to prevent parallel access to the same memory location. I guess this is called serialized, making every thread access the memory one by one. – Sean Jun 16 '12 at 18:06
2

Atomic operations are primitives used *for* serialization, but they are not serialized themselves; serialization refers to multiple entities *performing the same operation* one at a time, whereas atomic operations preform an arbitrary operation in one discrete motion, undisturbed by others. – Ignacio Vazquez-Abrams Jun 16 '12 at 18:24

score 3 · Answer 3 · edited May 23 '17 at 12:32

The x86 processors are known for a hairy design with lots of features, lots of rules, and even more exceptions to all those rules. This is related to the long history to the family.

When compilers or people are using LOCK, they are always using it with all its limitations, often on data specially introduced to perform synchronization between threads, as opposed to application data that the algorithms eventually manipulate. One then adapts the thread synchronization protocols to what LOCK can do for them, rather than vice versa.

The general type of instruction you seem to look for is called memory barriers. Indeed, x86 has several "modern" instructions from this family (MFENCE, LFENCE, SFENCE). They are full fence, load fence, and store fence, respectively. However, their importance in the instruction set is limited to SSE, because Intel guarantees serialization of writes on the traditional part of the instruction set, and that is pretty much the reason why this aged architecture is quite an easy target for multithreaded programming.

LOCK prefix of Intel instruction. What is the point?

5 Answers5