5

I have a struct consisting of seven __m256 values, which is stored 32-byte aligned in memory.

typedef struct
{
        __m256 xl,xh;
        __m256 yl,yh;
        __m256 zl,zh;
        __m256i co;
} bloxset8_t;

I achieve the 32-byte alignment by using the posix_memalign() function for dynamically allocated data, or using the (aligned(32)) attribute for statically allocated data.

The alignment is fine, but when I use two pointers to such a struct, and pass them as destination and source for memcpy() then the compiler decides to use __memcpy_avx_unaligned() to copy.

How can I force clang to use the aligned avx memcpy function instead, which I assume is the faster variant?

OS: Ubuntu 16.04.3 LTS, Clang: 3.8.0-2ubuntu4.

UPDATE
The __memcpy_avx_unaligned() is invoked only when copying two or more structs. When copying just one, clang emits 14 vmovup instructions.

Bram
  • 7,440
  • 3
  • 52
  • 94
  • Untested, but worth a try: I think I've seen this done before by adding an `assert()` before the `memcpy` that asserts that the address is 32-byte aligned. Some compilers can take these hints and use them for optimization. – Jason R Nov 10 '17 at 22:08
  • I could not reproduce this with Clang 3.9 (I get a bunch of `vmovaps`), unfortunately I can't try 3.8 – harold Nov 10 '17 at 22:17
  • @harold memcpy_avx_unaligned() is used if you copy two or more structs in one go. One struct is indeed done with move instructions, which in my case are unaligned: vmovup (and it uses 14 of them.) – Bram Nov 10 '17 at 22:28
  • I think for static / automatic storage, you're already fine for alignment. `__m256` implies 32B alignment already. But yes, you should use `aligned_alloc` or `posix_memalign` for dynamic allocation. – Peter Cordes Nov 10 '17 at 23:14

1 Answers1

6

__memcpy_avx_unaligned is just an internal glibc function name. It does not mean that there is a faster __memcpy_avx_aligned function. The name is just convey a hint to the glibc developers how this memcpy variant is implemented.

The other question is whether it would be faster for the C compiler to emit an inline expansion of memcpy, using four AVX2 load/store operations. The code for that would be larger than the memcpy call, but it might still be faster overall. It may be possible to help the compiler to do this using the __builtin_assume_aligned builtin.

Florian Weimer
  • 32,022
  • 3
  • 48
  • 92
  • Near duplicate of [perf report shows this function "\_\_memset\_avx2\_unaligned\_erms" has overhead. does this mean memory is unaligned?](https://stackoverflow.com/q/51614543), or at least related. I went into more detail there about how that specific glibc memset strategy works. – Peter Cordes Jul 31 '18 at 23:01