SSE instructions: which CPUs can do atomic 16B memory operations?

Question

Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory location is aligned to 16 bytes.

The document "Intel® 64 Architecture Memory Ordering White Paper" states that for "Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary" the memory operation appears to execute as a single memory access regardless of memory type.

The question: Do there exist Intel/AMD/etc x86 CPUs which guarantee that reading or writing 16 bytes (128 bits) aligned to a 16 byte boundary executes as a single memory access? Is so, which particular type of CPU is it (Core2/Atom/K8/Phenom/...)? If you provide an answer (yes/no) to this question, please also specify the method that was used to determine the answer - PDF document lookup, brute force testing, math proof, or whatever other method you used to determine the answer.

This question relates to problems such as http://research.swtch.com/2010/02/off-to-races.html

Update:

I created a simple test program in C that you can run on your computers. Please compile and run it on your Phenom, Athlon, Bobcat, Core2, Atom, Sandy Bridge or whatever SSE2-capable CPU you happen to have. Thanks.

// Compile with:
//   gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2
//
// Make sure you have at least two physical CPU cores or hyper-threading.

#include <pthread.h>
#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

typedef int v4si __attribute__ ((vector_size (16)));
volatile v4si x;

unsigned n1[16] __attribute__((aligned(64)));
unsigned n2[16] __attribute__((aligned(64)));

void* thread1(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n1[mask]++;

                x = (v4si){0,0,0,0};
        }
        return NULL;
}

void* thread2(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n2[mask]++;

                x = (v4si){-1,-1,-1,-1};
        }
        return NULL;
}

int main() {
        // Check memory alignment
        if ( (((uintptr_t)&x) & 0x0f) != 0 )
                abort();

        memset(n1, 0, sizeof(n1));
        memset(n2, 0, sizeof(n2));

        pthread_t t1, t2;
        pthread_create(&t1, NULL, thread1, NULL);
        pthread_create(&t2, NULL, thread2, NULL);
        pthread_join(t1, NULL);
        pthread_join(t2, NULL);

        for (unsigned i=0; i<16; i++) {
                for (int j=3; j>=0; j--)
                        printf("%d", (i>>j)&1);

                printf("  %10u %10u", n1[i], n2[i]);
                if(i>0 && i<0x0f) {
                        if(n1[i] || n2[i])
                                printf("  Not a single memory access!");
                }

                printf("\n");
        }

        return 0;
}

The CPU I have in my notebook is Core Duo (not Core2). This particular CPU fails the test, it implements 16-byte memory read/writes with a granularity of 8 bytes. The output is:

0000    96905702      10512
0001           0          0
0010           0          0
0011          22      12924  Not a single memory access!
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100     3092557       1175  Not a single memory access!
1101           0          0
1110           0          0
1111        1719   99975389

Really? And what do you think will happen when you install only 1 memory module in a Core2 motherboard? If you happen to have a Core2 CPU (or other "modern" x86-64 CPU), try installing only 1 memory module in your machine, actually run the test program I provided, and then please post your results. Thanks. — , Oct 05 '11 at 10:46
To clarify the text I wrote in the comment which starts with "Really? ...". The comment was a response to a previous message from a person who believed that my StackOverflow question is related to FSB or DRAM. In a way, it is somewhat related to FSB and DRAM, but the relationship of my question to FSB and DRAM is insignificant and does not play a major role. ... The person then deleted his/her own answer and comments, thus effectively erasing them from the historical record. **But, if you delete yourself from history, who will be able to remember you? Nobody.** — , Oct 05 '11 at 11:20
That's a really interesting discussion, thanks! It raises a question though: if 8 bytes is the maximum size of atomic memory accesses, does that mean that 80-bit extended floats are non-atomic? — Jens, Jan 24 '14 at 01:51
@Jens: Yes, 80bit FPU loads are probably not atomic. The ISA doesn't guarantee it (across all implementations), and in practice they're probably not atomic on most recent CPUs. e.g. `FLD m80` on Intel Haswell is 4 total uops: 2 ALU and 2 load-port. So it might well be implemented internally as a 64b load and a 16b load. 80bit FP is *not* a performance priority. 80bit FP stores take 7 uops, with a throughput of one per 5 cycles. — Peter Cordes, Oct 10 '15 at 02:50

janneb · Accepted Answer · 2022-01-11T06:57:01.057

In the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A, which nowadays contains the specifications of the memory ordering white paper you mention, it is said in section 8.1.1 that:

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

Reading or writing a byte.

Reading or writing a word aligned on a 16-bit boundary.

Reading or writing a doubleword aligned on a 32-bit boundary. The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

Reading or writing a quadword aligned on a 64-bit boundary.

16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:

Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:

MOVAPD, MOVAPS, and MOVDQA.

VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.

VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).

(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)

Each of the writes x = (v4si){0,0,0,0} and x = (v4si){-1,-1,-1,-1} are probably compiled into a single 16-byte MOVAPS. The address of x is 16-byte aligned. On an Intel processor that supports AVX, these writes are atomic. Otherwise, they are not atomic.

On AMD processors, AMD64 Architecture Programmer's Manual, Section 3.9.1.3 states that

Single load or store operations (from instructions that do just a single load or store) are naturally atomic on any AMD64 processor as long as they do not cross an aligned 8-byte boundary. Accesses up to eight bytes in size which do cross such a boundary may be performed atomically using certain instructions with a lock prefix, such as XCHG, CMPXCHG or CMPXCHG8B, as long as all such accesses are done using the same technique. (Note that misaligned locked accesses may be subject to heavy performance penalties.) CMPXCHG16B can be used to perform 16-byte atomic accesses in 64- bit mode (with certain alignment restrictions).

AMD processors thus do not guarantee that AVX instructions provide 16-byte atomicity.

On Intel processors that don't support AVX and on AMD processor, the CMPXCHG16B instruction with the LOCK prefix can be used. You can use the CPUID instruction to figure out if your processor supports CMPXCHG16B (the "CX16" feature bit).

EDIT: Test program results

(Test program modified to increase #iterations by a factor of 10)

On a Xeon X3450 (x86-64):

0000   999998139       1572
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111        1861  999998428

On a Xeon 5150 (32-bit):

0000   999243100     283087
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111      756900  999716913

On an Opteron 2435 (x86-64):

0000   999995893       1901
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111        4107  999998099

Note that the Intel Xeon X3450 and Xeon 5150 don't support AVX. The Opteron 2435 is an AMD processor.

Does this mean that Intel and/or AMD guarantee that 16 byte memory accesses are atomic on these machines? IMHO, it does not. It's not in the documentation as guaranteed architectural behavior, and thus one cannot know if on these particular processors 16 byte memory accesses really are atomic or whether the test program merely fails to trigger them for one reason or another. And thus relying on it is dangerous.

EDIT 2: How to make the test program fail

Ha! I managed to make the test program fail. On the same Opteron 2435 as above, with the same binary, but now running it via the "numactl" tool specifying that each thread runs on a separate socket, I got:

0000   999998634       5990
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          1  Not a single memory access!
1101           0          0
1110           0          0
1111        1366  999994009

So what does this imply? Well, the Opteron 2435 may, or may not, guarantee that 16-byte memory accesses are atomic for intra-socket accesses, but at least the cache coherency protocol running on the HyperTransport interconnect between the two sockets does not provide such a guarantee.

EDIT 3: ASM for the thread functions, on request of "GJ."

Here's the generated asm for the thread functions for the GCC 4.4 x86-64 version used on the Opteron 2435 system:


.globl thread2
        .type   thread2, @function
thread2:
.LFB537:
        .cfi_startproc
        movdqa  .LC3(%rip), %xmm1
        xorl    %eax, %eax
        .p2align 5,,24
        .p2align 3
.L11:
        movaps  x(%rip), %xmm0
        incl    %eax
        movaps  %xmm1, x(%rip)
        movmskps        %xmm0, %edx
        movslq  %edx, %rdx
        incl    n2(,%rdx,4)
        cmpl    $1000000000, %eax
        jne     .L11
        xorl    %eax, %eax
        ret
        .cfi_endproc
.LFE537:
        .size   thread2, .-thread2
        .p2align 5,,31
.globl thread1
        .type   thread1, @function
thread1:
.LFB536:
        .cfi_startproc
        pxor    %xmm1, %xmm1
        xorl    %eax, %eax
        .p2align 5,,24
        .p2align 3
.L15:
        movaps  x(%rip), %xmm0
        incl    %eax
        movaps  %xmm1, x(%rip)
        movmskps        %xmm0, %edx
        movslq  %edx, %rdx
        incl    n1(,%rdx,4)
        cmpl    $1000000000, %eax
        jne     .L15
        xorl    %eax, %eax
        ret
        .cfi_endproc

and for completeness, .LC3 which is the static data containing the (-1, -1, -1, -1) vector used by thread2:


.LC3:
        .long   -1
        .long   -1
        .long   -1
        .long   -1
        .ident  "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)"
        .section        .note.GNU-stack,"",@progbits

Also note that this is AT&T ASM syntax, not the Intel syntax Windows programmers might be more familiar with. Finally, this is with march=native which makes GCC prefer MOVAPS; but it doesn't matter, if I use march=core2 it will use MOVDQA for storing to x, and I can still reproduce the failures.

No! Reading or writing should be performed atomically granted by hardware up to 32 bytes if translated memory in under one cache line. Check my answer. — GJ., Oct 07 '11 at 08:30
@GJ.No, you're wrong. The cache line size (64 bytes on most, if not all, x86 machines, FWIW) does not determine the atomicity of memory accesses. — janneb, Oct 07 '11 at 08:33
True for RMW, but **not** for single read xor write! If cache bust read/write is interrupted the exception fault is rasing. — GJ., Oct 07 '11 at 08:42
Check again the Intel manual from your link section 8.1.1: **Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors.** So, in any other case is atomic. — GJ., Oct 07 '11 at 08:50
@GJ.: I have no idea what you're trying to say, but anyway, if the processor implements a 16-byte store instruction internally as 2 8-byte stores in the store pipeline (as it's allowed to do per the architectural guarantees provided in the programming manual), it's perfectly possible for another processor to "steal" the cache line in between the two stores. Unlikely yes, but not impossible, as can be seen from the failed test I show in my answer. — janneb, Oct 07 '11 at 08:51
@janneb: About the guarantee of atomicity of the memory access: Well, in my opinion, based on **complexity argument**, given the results you provided it is possible to derive that 16-byte aligned memory accesses via MOVAPS and MOVDQA are **always** atomic. If this isn't enough proof, you might want to replace the for-loop in the test program with a loop that will take 10+ minutes to execute, and let it run while you are doing other tasks on the machine. If you will do this, please append the obtained results to your answer - I would like to know whether the complexity argument actually works. — , Oct 07 '11 at 09:11
@Atom: Actually, as my last test shows, at least on the Opteron 2435, MOVDQA is NOT atomic. — janneb, Oct 07 '11 at 10:22
@janneb: Are you running the memory controller on the Opteron 2435 setup in ganged mode or in unganged mode? — , Oct 07 '11 at 10:56
@Atom: Unganged, I believe; It's usually the default these days. That being said, I think it's irrelevant for this question since the cache coherency hw will transfer the cache line back and forth without it hitting the memory controllers. — janneb, Oct 07 '11 at 11:33
@janneb: Huh, I didn't see your second edit... But I stil doubt, because I never expirience such a problem. Can you show how debbuger see thread procedure assembler? — GJ., Oct 07 '11 at 12:19
@janneb: Thanx, I have check code and looks OK. I still can't reproduce the error on my PCs on any way. — GJ., Oct 07 '11 at 14:15
@janneb: I marked your answer as the correct one. I gave the bounty to GJ. — , Oct 12 '11 at 09:27
@janneb, could you please tell me how to do "via the "numactl" tool specifying that each thread runs on a separate socket" — Derek Zhang, May 06 '19 at 12:37
@DerekZhang: This was done ~8 years ago, so I don't remember exactly. Basically, checking the NUMA topology of the system (see /proc/cpuinfo), then use numactl to allow use of 2 cores on different sockets. E.g. something like "numactl --physcpubind=N,M ./a.out" where N and M are indices of two cpu cores on different sockets. — janneb, May 07 '19 at 05:51
https://rigtorp.se/isatomic/ has test results for more modern CPUs. Still a total lack of *documented* guarantees that anything beyond 8 bytes (or `lock cmpxchg16b`) is atomic, though. — Peter Cordes, Jun 10 '20 at 17:09
@PeterCordes: I just found out that in the current version (Nov 2021, though it may have been introduced earlier but I haven't followed it) of the Intel manual volume 3A (linked from the answer), certain 16-byte memory operations are guaranteed to be atomic provided the CPU supports AVX. — janneb, Jan 11 '22 at 07:01
@janneb: Yeah, I saw Hadi's edit! That's a game changer for compilers, and also about freaking time. (Also excellent that they tied it to an existing feature bit like AVX, so new builds can take advantage of the HW capability that's been there the whole time. Except on Pentium/Celeron budget CPUs; having the upper halves of their SIMD execution units fused off obviously has no impact on atomicity of 16-byte stuff, but they lose out anyway.) — Peter Cordes, Jan 11 '22 at 09:02
That EDIT2 is horrific. Literally one-in-a-billion torn accesses, in a test designed specifically to provoke it. Have fun reproducing any bugs based on that one... — TLW, Nov 27 '22 at 23:41

Peter Cordes · Answer 2 · 2023-08-03T20:41:59.730

Update: in 2022, Intel retroactively documented that the AVX feature bit implies that aligned 128-bit loads/stores are atomic, at least for Intel CPUs. AMD could document the same thing since in practice their CPUs with AVX support have I think avoided tearing on 8-byte boundaries. See @whatishappened's answer, and janneb's updated answer.

Pentium and Celeron versions of CPUs with AVX will also have the same atomicity in practice, but no documented way for software to detect it. Also presumably Core 2 and Nehalem, and probably some low-power Silvermont-family chips, which haven't had AVX until Alder Lake E-cores.

So finally we can have cheapish atomic __int128 loads/stores on AVX CPUs in a well-documented way. (So C++ std::atomic is_lock_free() could return true on some machines. But not is_always_lock_free as a compile-time constant unless arch options make a binary that requires AVX. GCC previously used lock cmpxchg16b to implement load/store, but changed in GCC7 IIRC to not advertise that as "lock free" since it didn't have the read-side scaling you'd expect with proper support.)

Old partly-updated answer below

Erik Rigtorp has done some experimental testing on recent Intel and AMD CPUs to look for tearing. Results at https://rigtorp.se/isatomic/. Keep in mind there's no documentation or guarantee about this behaviour (beyond 128-bit or on non-AVX CPUs), and IDK if it's possible for a custom many-socket machine using such CPUs to have less atomicity than the machines he tested on. But on current x86 CPUs (not K10), SIMD atomicity for aligned loads/stores simply scales with data-path width between cache and L1d cache.

The x86 ISA only guarantees atomicity for things up to 8B, so that implementations are free to implement SSE / AVX support the way Pentium III / Pentium M / Core Duo does: internally data is handled in 64bit halves. A 128-bit store is done as two 64-bit stores. The data path to/from cache is only 64b wide in the Yonah microarchitecture (Core Duo). (source:Agner Fog's microarch doc).

More recent implementations do have wider data paths internally, and handle 128b instructions as a single op. Core 2 Duo (conroe/merom) was the first Intel P6-descended microarch with 128b data paths. (IDK about P4, but fortunately it's old enough to be totally irrelevant.)

This is why the OP finds that 128b ops are not atomic on Intel Core Duo (Yonah), but other posters find that they are atomic on later Intel designs, starting with Core 2 (Merom).

The diagrams on this Realworldtech writeup about Merom vs. Yonah show the 128bit path between ALU and L1 data-cache in Merom (and P4), while the low-power Yonah has a 64bit data path. The data path between L1 and L2 cache is 256b in all 3 designs.

The next jump in data path width came with Intel's Haswell, featuring 256b (32B) AVX/AVX2 loads/stores, and a 64Byte path between L1 and L2 cache. I expect that 256b loads/stores are atomic in Haswell, Broadwell, and Skylake, but I don't have one to test.

Skylake-AVX512 has 512-bit data paths, so they're also naturally atomic at least in reading/writing L1d cache. The ring bus (client chips) transfers in 32-byte chunks, but Intel guarantees no tearing between 32B halves, since they guarantee atomicity for 8-byte load/store at any misalignment as long as it doesn't cross a cache-line boundary.

Zen 4 handles 512-bit ops as two halves, so probably not 512-bit atomicity.

As janneb points out in his excellent experimental answer, the cache-coherency protocol between sockets in a multi-core system might be narrower than what you get within a shared-last-level-cache CPU. There is no architectural requirement on atomicity for wide loads/stores, so designers are free to make them atomic within a socket but non-atomic across sockets if that's convenient. IDK how wide the inter-socket logical data path is for AMD's Bulldozer-family, or for Intel. (I say "logical", because even if the data is transferred in smaller chunks, it might not modify a cache line until it's fully received.)

Finding similar articles about AMD CPUs should allow drawing reasonable conclusions about whether 128b ops are atomic or not. Just checking instruction tables is some help:

K8 decodes movaps reg, [mem] to 2 m-ops, while K10 and bulldozer-family decode it to 1 m-op. AMD's low-power bobcat decodes it to 2 ops, while jaguar decodes 128b movaps to 1 m-op. (It supports AVX1 similar to bulldozer-family CPUs: 256b insns (even ALU ops) are split into two 128b ops. Intel SnB only splits 256b loads/stores, while having full-width ALUs.)

janneb's Opteron 2435 is a 6-core Istanbul CPU, which is part of the K10 family, so this single-m-op -> atomic conclusion appears accurate within a single socket.

Intel Silvermont does 128b loads/stores with a single uop, and a throughput of one per clock. This is the same as for integer loads/stores, so it's quite probably atomic.

I've seen your [comment](https://stackoverflow.com/questions/76139440/is-always-lock-free-gives-true-but-is-lock-free-gives-false-on-macos#comment134277367_76139440) stating some guarantees are retroactively documented. Care to update the answer? — Alex Guteniev, Aug 03 '23 at 20:11
@AlexGuteniev: Janneb's answer on this question is already updated with that info. But I guess I should correct this answer's statement about an explicit lack of vendor documentation; thanks for pointing that out. Updated. — Peter Cordes, Aug 03 '23 at 20:42

score 5 · Answer 3 · edited Oct 10 '15 at 04:14

5

The "AMD Architecture Programmer's Manual Volume 1: Application Programming" says in section 3.9.1: "CMPXCHG16B can be used to perform 16-byte atomic accesses in 64-bit mode (with certain alignment restrictions)."

However, there is no such comment about SSE instructions. In fact, there is a comment in 4.8.3 that the LOCK prefix "causes an invalid-opcode exception when used with 128-bit media instructions". It therefore seems pretty conclusive to me that the AMD processors do NOT guarantee atomic 128-bit accesses for SSE instructions, and the only way to do an atomic 128-bit access is to use CMPXCHG16B.

The "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1" says in 8.1.1 "An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses." This is pretty conclusive that 128-bit SSE instructions are not guaranteed atomic by the ISA. Volume 2A of the Intel docs says of CMPXCHG16B: "This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically."

Further, CPU manufacturers haven't published written guarantees of atomic 128b SSE operations for specific CPU models where that is the case.

edited Oct 10 '15 at 04:14

Peter Cordes

328,167
45
605
847

answered Oct 07 '11 at 15:48

Anthony Williams

66,628
14
133
155

CMPXCHG16B instruction means: IF MEM[ADDR]==X THEN MEM[ADDR]:=Y. This implies that you already need to know the value X that is stored in memory at ADDR **before** writing Y. My question assumes that the 128-bit data type at ADDR uses all of the 128 bits - thus any of the (1<<128) possible values can be stored there. So, it is impossible to know whether MEM[ADDR] equals to X or not. (Sidenote: there is a universal method of how to encode a 16-byte value using more than 16 bytes (for example: 24 bytes) so that the 16 bytes can be read/written safely. But that is a different question.) – Oct 07 '11 at 16:40
5

You can use `CMPXCHG16B` to read a value. First load any value into `RDX:RAX`, and load the **same** values into `RCX:RBX`. All registers zero would do. Then do `CMPXCHG16B [addr]`. If the values match then the same value is stored back. If they don't match then `RDX:RAX` is updated to the actual value. Either way `RDX:RAX` holds the original stored value, and the memory is unchanged. – Anthony Williams Oct 07 '11 at 21:18
About your sentence "This is pretty conclusive that 128-bit SSE instructions are never atomic": There is no evidence to support this sentence on particular CPUs. The answers to my question so far support the claim that on Core2 Quad Q6600, Core2 Duo P8400, Pentium4 hyper-threading, Xeon X3450, Xeon 5150, and 1-socket Opteron 2435, the memory accesses are **always** atomic. Maybe the test program should be modified so that the data goes through the memory chip more often, or the test program should run for a longer period of time (1 hour) while running/doing other tasks on the machine. – Oct 08 '11 at 13:01
1

Sorry. I meant never **guaranteed** to be atomic. Particular CPUs may happen to make them atomic, and they may appear atomic under particular test conditions, but there is no guarantee. – Anthony Williams Oct 08 '11 at 18:03
I think it's quite likely that some CPU designs in practice *do* have atomic 128b loads/stores, because their wide data paths never split the operation separate parts. The docs never say anything about this, and there isn't a CPUID bit for it, because they don't want people writing code that depends on it. (future low-power CPUs can always implement SIMD with narrower data paths.) But just because no Intel or AMD manual has the guarantee written down doesn't mean there isn't one, e.g. for Sandybridge. And probably 256b is atomic on Haswell. See my answer. Made the edit so I don't have to -1 – Peter Cordes Oct 10 '15 at 04:16
@AnthonyWilliams, the problem of using CMPXCHG16B to read a value is that CMPXCHG16B on read-only memory results in segfault. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94649 – zwhconst Oct 13 '21 at 07:28

Necrolis · Answer 4 · 2011-10-07T11:02:58.030

There is actually a warning in the Intel Architecture Manual Vol 3A. Section 8.1.1 (May 2011), under the section of guaranteed atomic operations:

An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g. due an page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section 4.10.4.4), such page faults may occur even if all accesses are to the same page.

thus SSE instructions are not guaranteed to be atomic, even if the underlying architecture does use a single memory access (this is one reason why the memory fencing was introduced).

Combine that with this statement from the Intel Optimization Manual, Section 13.3 (April 2011)

AVX and FMA instructions do not introduce any new guaranteed atomic memory operations.

and that fact that none of the load or store operation for SIMD guarantee atomicity, we can come to the conclusion that Intel doesn't not support any form of atomic SIMD (yet).

As an extra bit, if the memory is split along cache lines or page boundaries (when using things like movdqu which permit unaligned access), the following processors will not perform atomic accesses, regardless of alignment, but later processors will (again from the Intel Architecture Manual):

Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors

The question explicitly states that all the 16-byte memory acceses are aligned to 16 bytes. Thus none of them, by construction, crosses cache lines or page boundaries. — , Oct 07 '11 at 09:15
About the wording "may be implemented using multiple memory accesses" in the Intel Architecture Manual: My question is whether there exist **concrete physical** CPUs which happen to be implemented so that the memory accesses are always atomic. — , Oct 07 '11 at 09:19
@Necrolis: Yes, and **movdqa** doesn't perform any SIMD operation it is only memory to register or register to memory move. — GJ., Oct 07 '11 at 10:03
@Atom: That wording basically means: "there are none following this in our current product range, but we are free to add it in *future*", ie: there is no current processor that officially does this (from Intel), and tests at this level are unreliable as unofficial proof. The cache split/page boundary stuff was just some extra misc info, mainly for things like `movdqu`. — Necrolis, Oct 07 '11 at 10:52
@GJ: its still an SSE2 instruction, so it falls under the umbrella of Streaming SIMD (especially since the loads and stores are of *multiple packed values*). and btw, it also works on register to register moves :P — Necrolis, Oct 07 '11 at 10:53
@Necrolis: Yes this it is true for loads and stores of multiple packed values. But not for single 16 byte read or 16 byte write. — GJ., Oct 07 '11 at 12:01

whatishappened · Answer 5 · 2022-11-28T14:16:46.167

It looks like AMD will also specify in the next revision of their manual that aligned 16b loads and stores are atomic on their x86 processors which supports AVX. (Source)

Apologies for late response!

We would update the AMD APM manuals in the next revision.

For all AMD architectures,

Processors that support AVX extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.

which means all 128b instructions, even the *MOVDQU instructions, are atomic if they end up being naturally aligned.

Can we extend this patch to AMD processors as well. If not, I will plan to submit the patch for stage-1!

With this, the patch making libatomic use vmovdqa in their implementation of __atomic_load_16 and __atomic_store_16 not only on Intel processors with AVX but also on AMD processors with AVX has landed on the master branch.

score -1 · Answer 6 · edited Oct 19 '13 at 04:09

-1

EDIT: In the last two days I have made several tests on my three PCs and I didn't reproduce any memory error, so I can't say anything more precisely. Maybe is this memory error also dependent from OS.

EDIT: I'm programing in Delphi and not in C but I should understand C. So I have translated the code, here are you have the threads procedures where the main part is made in assembler:

procedure TThread1.Execute;
var
  n             :cardinal;
const
  ConstAll0     :array[0..3] of integer =(0,0,0,0);
begin
  for n := 0 to 100000000 do
    asm
      movdqa    xmm0, dqword [x]
      movmskps  eax, xmm0
      inc       dword ptr[n1 + eax *4]
      movdqu    xmm0, dqword [ConstAll0]
      movdqa    dqword [x], xmm0
    end;
end;

{ TThread2 }

procedure TThread2.Execute;
var
  n             :cardinal;
const
  ConstAll1     :array[0..3] of integer =(-1,-1,-1,-1);
begin
  for n := 0 to 100000000 do
    asm
      movdqa    xmm0, dqword [x]
      movmskps  eax, xmm0
      inc       dword ptr[n2 + eax *4]
      movdqu    xmm0, dqword [ConstAll1]
      movdqa    dqword [x], xmm0
    end;
end;

Result: no mistake on my quad core PC and no mistake on my dual core PC as expected!

PC with Intel Pentium4 CPU
PC with Intel Core2 Quad CPU Q6600
PC with Intel Core2 Duo CPU P8400

Can you show how debuger see your thread procedure code? Please...

edited Oct 19 '13 at 04:09

phuclv

37,963
15
156
475

answered Oct 07 '11 at 01:52

GJ.

10,810
2
45
62

FWIW, when I ran the tests for my answer, I checked that GCC 4.1 and 4.4 (x86 and x86-64) use movdqa when storing to x. – janneb Oct 07 '11 at 06:19
@janneb: ok I have translated main thread part of code to memonic and test it on two PCs. Result: no mistake! – GJ. Oct 07 '11 at 07:17
@GJ: Thanks. Could you please update your answer to include the kind(s) of CPU(s) you tested - just like "janneb" did. – Oct 07 '11 at 08:51
@GJ: About the Pentium4 CPU you tested: Does it have multiple cores and/or hyper-threading? – Oct 07 '11 at 16:47
@Atom: First one has hyper-threading. Other two are woking under 4 and 2 cores. OS first two win XP and last one win Vista. – GJ. Oct 07 '11 at 18:53
@GJ: I marked janneb's answer as the correct one. The bounty goes to you. – Oct 12 '11 at 09:26
https://rigtorp.se/isatomic/ has test results for more modern CPUs. – Peter Cordes Jun 10 '20 at 17:06

score -1 · Answer 7 · answered Oct 10 '11 at 10:46

Lot of answers have been posted so far and hence lot of information is already available (as a side effect lot of confusion too). I would like to site facts from Intel manual regarding hardware guaranteed atomic operations ...

In Intel's latest processors of nehalem and sandy bridge family, reading or writing to a quadword aligned to 64 bit boundary is guaranteed.

Even unaligned 2, 4 or 8 byte reads or writes are guaranteed to be atomic provided they are cached memory and fit in a cache line.

Having said that the test posted in this question passes on sandy bridge based intel i5 processor.

This question is specifically about 16 byte reads/writes, quadword is 8 bytes. In any case, thanks for running the test program. — , Oct 10 '11 at 11:52

SSE instructions: which CPUs can do atomic 16B memory operations?

7 Answers7

Old partly-updated answer below

Linked

Related