how to copy multiple data elements between CPUs using cacheline atomicity?

Question

I'm trying to implement an atomic copy for multiple data elements between CPUs. I packed multiple elements of data into a single cacheline to manipulate them atomically. So I wrote the following code.

In this code, (compiled with -O3) I aligned a global struct data into a single cacheline, and I set the elements in a CPU followed by a store barrier. It is to make globally visible from the other CPU.

At the same time, in the other CPU, I used an load barrier to access the cacheline atomically. My expectation was that the reader (or consumer) CPU should bring a cache line of data into the its own cache hierarchy L1, L2 etc.. So, since I do not use load barrier again until the next read, the elements of the data would be the same, but it does not work as expected. I can't keep the cacheline atomicity in this code. The writer CPU seems putting elements into the cacheline piece by piece. How could it be possible?

#include <emmintrin.h>
#include <pthread.h>
#include "common.h"

#define CACHE_LINE_SIZE             64

struct levels {
    uint32_t x1;
    uint32_t x2;
    uint32_t x3;
    uint32_t x4;
    uint32_t x5;
    uint32_t x6;
    uint32_t x7;
} __attribute__((aligned(CACHE_LINE_SIZE)));

struct levels g_shared;

void *worker_loop(void *param)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(15, &cpuset);

    pthread_t thread = pthread_self();

    int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    fatal_relog_if(status != 0, status);

    struct levels shared;
    while (1) {

        _mm_lfence();
        shared = g_shared;

        if (shared.x1 != shared.x7) {
            printf("%u %u %u %u %u %u %u\n",
                    shared.x1, shared.x2, shared.x3, shared.x4, shared.x5, shared.x6, shared.x7);
            exit(EXIT_FAILURE);
        }
    }

    return NULL;
}

int main(int argc, char *argv[])
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(16, &cpuset);

    pthread_t thread = pthread_self();

    memset(&g_shared, 0, sizeof(g_shared));

    int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    fatal_relog_if(status != 0, status);

    pthread_t worker;
    int istatus = pthread_create(&worker, NULL, worker_loop, NULL);
    fatal_elog_if(istatus != 0);

    uint32_t val = 0;
    while (1) {
        g_shared.x1 = val;
        g_shared.x2 = val;
        g_shared.x3 = val;
        g_shared.x4 = val;
        g_shared.x5 = val;
        g_shared.x6 = val;
        g_shared.x7 = val;

        _mm_sfence();
        // _mm_clflush(&g_shared);

        val++;
    }

    return EXIT_SUCCESS;
}

The output is like below

3782063 3782063 3782062 3782062 3782062 3782062 3782062

UPDATE 1

I updated the code as below using AVX512, but the problem is still here.

#include <emmintrin.h>
#include <pthread.h>
#include "common.h"
#include <immintrin.h>

#define CACHE_LINE_SIZE             64

/**
 * Copy 64 bytes from one location to another,
 * locations should not overlap.
 */
static inline __attribute__((always_inline)) void
mov64(uint8_t *dst, const uint8_t *src)
{
        __m512i zmm0;

        zmm0 = _mm512_load_si512((const void *)src);
        _mm512_store_si512((void *)dst, zmm0);
}

struct levels {
    uint32_t x1;
    uint32_t x2;
    uint32_t x3;
    uint32_t x4;
    uint32_t x5;
    uint32_t x6;
    uint32_t x7;
} __attribute__((aligned(CACHE_LINE_SIZE)));

struct levels g_shared;

void *worker_loop(void *param)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(15, &cpuset);

    pthread_t thread = pthread_self();

    int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    fatal_relog_if(status != 0, status);

    struct levels shared;
    while (1) {
        mov64((uint8_t *)&shared, (uint8_t *)&g_shared);
        // shared = g_shared;

        if (shared.x1 != shared.x7) {
            printf("%u %u %u %u %u %u %u\n",
                    shared.x1, shared.x2, shared.x3, shared.x4, shared.x5, shared.x6, shared.x7);
            exit(EXIT_FAILURE);
        } else {
            printf("%u %u\n", shared.x1, shared.x7);
        }
    }

    return NULL;
}

int main(int argc, char *argv[])
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(16, &cpuset);

    pthread_t thread = pthread_self();

    memset(&g_shared, 0, sizeof(g_shared));

    int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    fatal_relog_if(status != 0, status);

    pthread_t worker;
    int istatus = pthread_create(&worker, NULL, worker_loop, NULL);
    fatal_elog_if(istatus != 0);

    uint32_t val = 0;
    while (1) {
        g_shared.x1 = val;
        g_shared.x2 = val;
        g_shared.x3 = val;
        g_shared.x4 = val;
        g_shared.x5 = val;
        g_shared.x6 = val;
        g_shared.x7 = val;

        _mm_sfence();
        // _mm_clflush(&g_shared);

        val++;
    }

    return EXIT_SUCCESS;
}

*I used an load barrier to access the cacheline atomically* No, barriers do not *create* atomicity. They only order your own operations, not stop operations from other threads from appearing between two of our own. Non-atomicity happens when another thread's store becomes visible between two of our loads. — Peter Cordes, Jul 12 '19 at 12:22
You can't without faking it via locks. Or with AVX512 on a CPU which happens to implement it in a way that makes aligned 64-byte loads/stores atomic even though there's no on-paper guarantee. — Peter Cordes, Jul 12 '19 at 12:27
what hardware are you testing on? Are you emulating AVX512 with SDE or something? Or do you really have a Skylake-AVX512? Oh wait a minute, **you're writing the cache line non-atomically, with 7x `uint32_t` assignments**. That's not even a full-line write so the compiler couldn't auto-vectorize it to a ZMM store. If you want to try AVX512, you need to use `mov64` for both read and write, duh. I think you can probably use a masked store and still have it be atomic for the 60 bytes you do write, though. — Peter Cordes, Jul 16 '19 at 01:17
@PeterCordes, I have a Skylake and I know I'm writing the cache line with 7x uint32_t, but since I compiled it via -O3, the other CPU should not have seen the changes until sfence runs, but it does. I don't know how but the CPU is sometimes speculatively fetch the without memory barriers. Thanks a lot, I could make it cacheline atomic using union of __m512i and the struct levels with _mm512_store_si512/_mm512_load_si512 — avatli, Jul 17 '19 at 06:10
No, `sfence` just makes later stores wait for earlier stores. But the CPU already does that anyway for non-NT stores. `sfence` is 100% useless here. All you need is `atomic_thread_fence(mo_release)` or `asm("" ::: "memory")` to force the stores to happen at all instead of being optimized away until after the (infinite) loop because reading from another thread is data-race UB so can be assumed to not happen. Like my answer says, barriers do not create atomicity. The CPU tries to commit retired stores from the store buffer into L1d cache as quickly as possible (but in program order). — Peter Cordes, Jul 17 '19 at 06:25
[When should I use \_mm\_sfence \_mm\_lfence and \_mm\_mfence](//stackoverflow.com/a/50780314) — Peter Cordes, Jul 17 '19 at 06:26
And BTW, no, 7x `uint32_t` is not a *full* cache line. You're 1 `uint32_t` short of 64 bytes. — Peter Cordes, Jul 17 '19 at 06:27

Peter Cordes · Accepted Answer · 2019-07-13T05:38:39.797

I used an load barrier to access the cacheline atomically

No, barriers do not create atomicity. They only order your own operations, not stop operations from other threads from appearing between two of our own.

Non-atomicity happens when another thread's store becomes visible between two of our loads. lfence does nothing to stop that.

lfence here is pointless; it just makes the CPU running this thread stall until it drains its ROB/RS before executing the loads. (lfence serializes execution, but has no effect on memory ordering unless you're using NT loads from WC memory e.g. video RAM).

Your options are:

Recognize that this is an X-Y problem and do something that doesn't require 64-byte atomic loads/stores. e.g. atomically update a pointer to non-atomic data. The general case of that is RCU, or perhaps a lock-free queue using a circular buffer.

Or

Use a software lock to get logical atomicity (like _Atomic struct levels g_shared; with C11) for threads that agree to cooperate by respecting the lock.

A SeqLock might be a good choice for this data if it's read more often than it changes, or especially with a single writer and multiple readers. Readers retry when tearing may have been possible; check a sequence number before/after the read, using sufficient memory-ordering. See Implementing 64 bit atomic counter with 32 bit atomics for a C++11 implementation; C11 is easier because C allows assignment from a volatile struct to a non-volatile temporary.

Or hardware-supported 64-byte atomicity:

Intel transactional memory (TSX) available on some CPUs. This would even let you do an atomic RMW on it, or atomically read from one location and write to another. But more complex transactions are more likely to abort. Putting 4x 16-byte or 2x 32-byte loads into a transaction should hopefully not abort very often even under contention. Safe for grouping stores into a separate transaction. (Hopefully the compiler is smart enough to end the transaction with the loaded data still in registers, so it doesn't have to be atomically stored to a local on the stack, too.)

There are GNU C/C++ extensions for transactional memory. https://gcc.gnu.org/wiki/TransactionalMemory
AVX512 (allowing a full-cache-line load or store) on a CPU which happens to implement it in a way that makes aligned 64-byte loads/stores atomic. There's no on-paper guarantee that anything wider than an 8-byte load/store is ever atomic on x86, except for lock cmpxchg16b and movdir64b.

In practice we're fairly sure that modern Intel CPUs like Skylake transfer whole cache-lines atomically between cores, unlike AMD. And we know that on Intel (not AMD) a vector load or store that doesn't cross a cache-line boundary does make a single access to L1d cache, transferring all the bits in the same clock cycle. So an aligned vmovaps zmm, [mem] on Skylake-avx512 should in practice be atomic, unless you have an exotic chipset that glues many sockets together in a way that creates tearing. (Multi-socket K10 vs. single-socket K10 is a good cautionary tale: Why is integer assignment on a naturally aligned variable atomic on x86?)
MOVDIR64B - only atomic for the store part, and only supported on Intel Tremont (next-gen Goldmont successor). This still doesn't give you a way to do a 64-byte atomic load. Also it's a cache-bypassing store so not good for inter-core communication latency. I think the use-case is generating a full-size PCIe transaction.

See also SSE instructions: which CPUs can do atomic 16B memory operations? re: lack of atomicity guarantees for SIMD load/store. CPU vendors have for some reason not chosen to provide any written guarantees or ways to detect when SIMD loads/stores will be atomic, even though testing has shown that they are on many systems (when you don't cross a cache-line boundary.)

Another option, esoteric and not on all Intel hardware, would be transactional memory. https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions — Arch D. Robison, Jul 13 '19 at 03:39
Another option would be to atomically update pointer/s to the data (e.g. producer atomically reads a `nextWriteAddress` pointer, writes data to the pointer, then atomically sets `nextReadAddress`; and consumer atomically reads a `nextReadAddress`, reads data from the pointer, then atomically sets a `nextWriteAddress`; so that producer and consumer access different cache lines at the same time and you only need a minimum of 2 copies of the structure). — Brendan, Jul 13 '19 at 05:23
Another option would be to implement "retry if read wrong". E.g. have a version number at the start and end of the structure, use barriers to make sure the order of writes is "start version, data, end version", use barriers to make sure the order of reads is "start version, data, end version", then make the reader do `do { read(); } while(start version != end version);` — Brendan, Jul 13 '19 at 05:29
@Brendan: right yeah, I was only thinking about the direct question, not the ultimate XY problem. Good point, RCU or a lock-free queue might work. — Peter Cordes, Jul 13 '19 at 05:30
@Brendan: "retry if read wrong" is a SeqLock: I did mention that. — Peter Cordes, Jul 13 '19 at 05:30
@PeterCordes: I updated the question with a new code using AVX512 but it is not working, it seems **not** allowing a full-cache-line load or store — avatli, Jul 15 '19 at 17:32

how to copy multiple data elements between CPUs using cacheline atomicity?

1 Answers1

Your options are: