Intel 64 and IA-32 | Atomic operations including acquire / release semantic

Question

According to the Intel 64 and IA-32 Architectures Software Developer's Manual the LOCK Signal Prefix "ensures that the processor has exclusive use of any shared memory while the signal is asserted". That can be a in the form of a bus or cache lock.

But - and that's the reason I'm asking this question - it isn't clear to me, if this Prefix also provides any memory-barrier.

I'm developing with NASM in a multi-processor environment and need to implement atomic operations with optional acquire and/or release semantics.

So, do I need to use the MFENCE, SFENCE and LFENCE instructions or would this be redundant?

score 7 · Accepted Answer · edited Jun 20 '20 at 09:12

No, there is no need to use instructions MFENCE, SFENCE and LFENCE in relation with LOCK prefix.

MFENCE, SFENCE and LFENCE instruction guarantee visibility of memory in all CPU cores. On instance the MOV instruction can't be used with LOCK prefix, so to be sure that result of memory move is visible to all CPU cores we must be sure that CPU cache is flushed to RAM and that we reach with fence instructions.

EDIT: more about locked atomic operations from Intel manual:

LOCKED ATOMIC OPERATIONS

The 32-bit IA-32 processors support locked atomic operations on locations in system memory. These operations are typically used to manage shared data structures (such as semaphores, segment descriptors, system segments, or page tables) in which two or more processors may try simultaneously to modify the same field or flag. The processor uses three interdependent mechanisms for carrying out locked atomic operations:

• Guaranteed atomic operations

• Bus locking, using the LOCK# signal and the LOCK instruction prefix

• Cache coherency protocols that insure that atomic operations can be carried out on cached data structures (cache lock); this mechanism is present in the Pentium 4, Intel Xeon, and P6 family processors

These mechanisms are interdependent in the following ways. Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location. The processor also supports bus locking for performing selected memory operations (such as a read-modify-write operation in a shared area of memory) that typically need to be handled atomically, but are not automatically handled this way. Because frequently used memory locations are often cached in a processor’s L1 or L2 caches, atomic operations can often be carried out inside a processor’s caches without asserting the bus lock. Here the processor’s cache coherency protocols insure that other processors that are caching the same memory locations are managed properly while atomic operations are performed on cached memory locations.

Yes, I know. But I want to use (for example) an interlocked increment to signal I'm aquiring a resource. Thus means I need to use 'lfence' prior the increment. Why? Because I must be sure that every prior load operation has finished before I'm signaling. — 0xbadf00d, Jan 27 '11 at 12:31
No, there is no need to do that, it should be made atomaticly. Check my edit about locked atomic operations. — GJ., Jan 27 '11 at 12:59
But an atomic operation has nothing to do with prioer or followed memory reads and writes?! Check this blog: http://blogs.msdn.com/b/kangsu/archive/2007/07/16/volatile-acquire-release-memory-fences-and-vc2005.aspx — 0xbadf00d, Jan 27 '11 at 13:07
Yes of course. The cache coherency mechanism only automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area. **So you do not need lfence.** Check for more information in Intel reference chapter 8: MULTIPLE-PROCESSOR MANAGEMEN http://www.intel.com/Assets/PDF/manual/253668.pdf — GJ., Jan 27 '11 at 13:33
To be more clear: You do not need lfence prior the increment if you are using lock prefix! — GJ., Jan 27 '11 at 13:41

score 5 · Answer 2 · answered May 31 '13 at 04:25

5

No. From the IA32 manuals (Volume 3A, Chapter 8.2: Memory Ordering):

Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.

Therefore, a fence instruction is not needed with locked instructions.

answered May 31 '13 at 04:25

etherice

1,761
15
25

That is what I thought. But for whatever reason, today I am unable to code this highly simplistic mutex implementation, just as a test, because no combination of lock prefix or / {l,s}fence seems to prevent both threads getting the same -1 value. Full source is at : [ intel_lock1.c ] : https://drive.google.com/file/d/1je5lNcv7nzS802BJweM4NVUcYYIwfxQn/view?usp=sharing – JVD Jun 02 '18 at 15:18
I guess an mfence really is required, as well as the locked subl / xaddl . – JVD Jun 02 '18 at 15:19

score -2 · Answer 3 · answered Jun 02 '18 at 15:33

-2

Problem still occurs when intel_lock1.c (available at URL above) is compiled on linux with GCC 5 or 7 without either of the args '-D_WITH_CLFLUSH_' or '-D_WITH_HLE_' (so that neither CLFLUSH* nor HLE XACQUIRE are used) - the mutex_lock assembler now looks like:

# 74 "intel_lock1.c" 1
    LFENCE
    lock subl   $1, lck(%rip)
    rep nop
    SFENCE

So, I'm trying replacing {L,S}FENCE with MFENCE .

I still don't quite understand how two threads can end up with same -1 *lck value though.

answered Jun 02 '18 at 15:33

JVD

645
1
7
17

1

Was this supposed to be an edit to your recent question? It's definitely not an answer to anything, and certainly not to this question. – Peter Cordes Jun 02 '18 at 15:41
See updated [ intel_lock1.c ] : https://drive.google.com/file/d/1je5lNcv7nzS802BJweM4NVUcYYIwfxQn/view?usp=sharing – JVD Jun 02 '18 at 16:03
It is towards an answer, at least to me. – JVD Jun 02 '18 at 16:03
The updated version can be compiled with -D_NO_FENCE_ to not use any fences at all, or to use mfence with -D_MFENCE_ - still, problem occurs : both threads get -1 . The specific question I was trying to ask is 'why do two threads get the same value when both execute lock subl , $addr , 1 and contents of *addr is 0 ? ' Is there any combination of Intel Atomic / Locking primitives that enables this decrement / increment to be done such that two threads must always see each other's results and obtain different values? – JVD Jun 02 '18 at 16:09
Oops, sorry, my browser changed tabs on me and I posted these responses to the wrong question - I meant to update only: https://stackoverflow.com/questions/50657795/intel-64-and-ia32-atomic-operations-acquire-release-semantics-and-gcc-5 – JVD Jun 02 '18 at 16:20

Intel 64 and IA-32 | Atomic operations including acquire / release semantic

3 Answers3

Linked