0

Would like to write 256bit of data on one core and read it on another one. So there will be only one process to write and can be multiple readers.

Was thinking to implement it using AVX. The reads and writes should be atomic since they are only 1 instruction (vmovdqa) and if aligned by cache line cache coherency would move the data atomically between cores.

Looked at the generated assembly but can see 2 writes and 2 reads. Why is not there just one? Would this solution for atomic read/write work given the assumptions?

#include <immintrin.h>
#include <cstdint>

struct Data {
     int64_t a[4];
};

struct DataHolder {

    void set_data(Data* in) { 
        _mm256_store_si256(reinterpret_cast<__m256i *>(&data_), *reinterpret_cast<__m256i *>(in));
    }

    void get_data(Data* out) { 
        _mm256_store_si256(reinterpret_cast<__m256i *>(out), *reinterpret_cast<__m256i *>(&data_));
    }

    alignas(64) Data data_;
    char padding [64 - sizeof(Data)];
};

int main() {
    Data a, b;
    DataHolder ab;
    ab.set_data(&a);
    ab.get_data(&b);
}


DataHolder::set_data(Data*):
        push    rbp
        mov     rbp, rsp
        and     rsp, -32
        mov     QWORD PTR [rsp-72], rdi
        mov     QWORD PTR [rsp-80], rsi
        mov     rax, QWORD PTR [rsp-80]
        vmovdqa ymm0, YMMWORD PTR [rax]
        mov     rax, QWORD PTR [rsp-72]
        mov     QWORD PTR [rsp-8], rax
        vmovdqa YMMWORD PTR [rsp-64], ymm0
        mov     rax, QWORD PTR [rsp-8]
        vmovdqa ymm0, YMMWORD PTR [rsp-64]
        vmovdqa YMMWORD PTR [rax], ymm0
        nop
        nop
        leave
        ret
DataHolder::get_data(Data*):
        push    rbp
        mov     rbp, rsp
        and     rsp, -32
        mov     QWORD PTR [rsp-72], rdi
        mov     QWORD PTR [rsp-80], rsi
        mov     rax, QWORD PTR [rsp-72]
        vmovdqa ymm0, YMMWORD PTR [rax]
        mov     rax, QWORD PTR [rsp-80]
        mov     QWORD PTR [rsp-8], rax
        vmovdqa YMMWORD PTR [rsp-64], ymm0
        mov     rax, QWORD PTR [rsp-8]
        vmovdqa ymm0, YMMWORD PTR [rsp-64]
        vmovdqa YMMWORD PTR [rax], ymm0
        nop
        nop
        leave
        ret
main:
        push    rbp
        mov     rbp, rsp
        and     rsp, -64
        add     rsp, -128
        lea     rdx, [rsp+96]
        mov     rax, rsp
        mov     rsi, rdx
        mov     rdi, rax
        call    DataHolder::set_data(Data*)
        lea     rdx, [rsp+64]
        mov     rax, rsp
        mov     rsi, rdx
        mov     rdi, rax
        call    DataHolder::get_data(Data*)
        mov     eax, 0
        leave
        ret
Peter Kohn
  • 51
  • 4
  • @MichaelChourdakis x86-64 gcc 9.2 – Peter Kohn May 21 '20 at 13:35
  • In the `get_data` I think you want the read to be atomic not the write. As for the bad assembly, you do not seem to have optimization enabled (as evidenced by the frame pointer among other things). – Jester May 21 '20 at 13:51
  • yes, in get_data atomic read, and in set_data atomic write. yes, it is not optimized but I was thinking even if it is not optimized it would generate just one instruction – Peter Kohn May 21 '20 at 14:12
  • 3
    The compiler is set to be too braindamaged. The extra loads/stores are to/from the stack, so they only touch private (non-shared) memory (unless in addition to insanity with assumptions about atomicity you also access objects on other threads' stacks). Note, that being a single instruction has **nothing** to do with atomicity. AVX2 256-bit load/stores are **not at all guaranteed to be atomic**. – EOF May 21 '20 at 14:22
  • @EOF you mean they are not atomic in a sense how they write to a cache line? A partial of the 256 can be written and shared with other cores? – Peter Kohn May 21 '20 at 14:32
  • 1
    You can find the answer in `Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide`, `8.1.1 Guaranteed Atomic Operations`. – EOF May 21 '20 at 14:36
  • 1
    Related: https://stackoverflow.com/questions/46012574/per-element-atomicity-of-vector-load-store-and-gather-scatter and https://stackoverflow.com/questions/30948832/largest-data-type-which-can-be-fetch-anded-atomically – chtz May 21 '20 at 14:41
  • @EOF I see, it is about writing to memory and SSE cannot be used with lock. So the larges that can be done atomically is 16 bytes. Correct? – Peter Kohn May 21 '20 at 14:43
  • 1
    You can only do 16-byte atomic accesses with `cmpxchg16b`, not with SSE/AVX-instructions, AFAIR. – EOF May 21 '20 at 14:45
  • ok, thank you guys – Peter Kohn May 21 '20 at 14:50
  • 1
    *The reads and writes should be atomic since they are only 1 instruction (vmovdqa)* very much not true. AMD before Zen2 decodes YMM loads/stores to 2 separate uops. Intel Sandybridge (before Haswell) runs YMM loads / stores as 1 uop but taking 2 cycles in the load or store ports for the 16-byte halves. – Peter Cordes May 21 '20 at 16:25
  • 1
    That said, in practice on Intel Haswell and later CPUs at least, aligned YMM loads/stores are widely believed to be atomic in practice, and so are the cache-line transfers between cores. But vendors still haven't documented any guarantees at all about that fact so it's not portably usable / no way to detect it. See [SSE instructions: which CPUs can do atomic 16B memory operations?](https://stackoverflow.com/a/7647825) for how subtle the corner-case gotchas can be for 16-byte SSE: Some K10 systems tear at 8-byte boundaries only between sockets of a multi-socket server. – Peter Cordes May 21 '20 at 16:28
  • @PeterCordes ok, so Intel did not say that Haswell and later are atomic? Not sure how to make the decision. I am using the latest Intel processors. – Peter Kohn May 21 '20 at 16:47
  • Yeah, there are no guarantees from vendors about atomicity of anything wider than 16 bytes, and even 16 bytes is only with a slow `lock cmpxcgh16b`. If you have some way of sanity checking or detecting possible future problems, I'd say go for it if it's code running on your own servers, if the performance gain is worth the effort to keep an eye on this possible gotcha in the future. (Unlikely that future Intel CPUs will break this atomicity, but possibly AMD could be different...) – Peter Cordes May 21 '20 at 17:45
  • Also note that `char padding [64 - sizeof(Data)];` is pointless; `alignas(64)` on the Data member makes sure the size is a multiple of 64 so that an array of `DataHolder` objects would have each of them satisfying the alignment requirement. `sizeof(DataHolder)` is at least 64 because it has an `alignas(64)` member. – Peter Cordes May 21 '20 at 18:41

0 Answers0