0

If I have some class with a field like __m256i* loaded_v, and a method like:

void load() {
    loaded_v = &_mm256_load_si256(reinterpret_cast<const __m256i*>(vector));
}

For how long will loaded_v be a valid pointer? Since there are a limited number of registers, I would imagine that eventually loaded_v will refer to a different value, or some other weird behavior will happen. However, I would like to reduce the number of loads I do.

I'm writing a packed bit array class, and I would like to use AVX intrinsics to increase performance. However, it is inefficient to load my array of bits every time I do some operation (and, or, xor, etc). Therefore, I would like to be able to explicitly call load() before performing some batch of operations. However, I don't understand how exactly AVX registers are handled. Could anyone help me out, or point to me to some documentation for this issue?

multitaskPro
  • 569
  • 4
  • 14
  • This code doesn't even compile, you are taking the address of an rvalue. – Marc Glisse Nov 07 '21 at 07:58
  • @MarcGlisse is the only option to store the entire __m256i value as a field in my class then? – multitaskPro Nov 07 '21 at 08:01
  • `__m256i` is not fundamentally different from `int` or `float`, the compiler still has to do register allocation for them. Load intrinsics are basically just there to communicate aligned or not, and as a cast. `__m256i` isn't register-only or anything like that, and accessing an `__m256i` from an array or struct will normally involve a load instruction unless the array optimizes away. [Understanding how the instrinsic functions for SSE use memory](https://stackoverflow.com/q/29147932) – Peter Cordes Nov 07 '21 at 16:42
  • Better near-duplicate: [SIMD Intrinsics and Persistent Variables/State](https://stackoverflow.com/q/48407505) – Peter Cordes Nov 07 '21 at 16:44

1 Answers1

3

The optimizing compiler would use registers automatically.

It may put a __m256 variable into memory, or in a register, or may use a register in one part of you code, and spill it in another. This can be done not only with standalone automatic storage (stack) variable, but also with member of a class, especially if the class instance is an automatic storage variable itself.

In case of registers usage, __m256 variable would correspond one of ymm registers (one of 16 in x86-64, one of 8 in 32-bit compilation, or one of 32 in x86-64 with AVX512), there's no need to indirectly refer to it.

The _mm256_load_si256 intrinsic doesn't necessarily compile to vmovdqa. For example, this code:

#include <immintrin.h>

__m256i f(__m256i a, const void* p)
{
    __m256i  b = _mm256_load_si256(reinterpret_cast<const __m256i*>(p));
    return _mm256_xor_si256(a, b);
}

Compiles as following (https://godbolt.org/z/ve67YPn4T):

        vpxor   ymm0, ymm0, YMMWORD PTR [rdx]
        ret     0

C and C++ are high level languages; the intrinsics should be seen as a way to convey the semantic to the compiler, not instruction mnemonics.

You should load a value into a variable,

__m256i loaded_v;
loaded_v = _mm256_load_si256(reinterpret_cast<const __m256i*>(vector));

or a temporary:

__m256_whatever_operation(_mm256_load_si256(reinterpret_cast<const __m256i*>(vector)), other_operand);

And you should follow the usual C or C++ rules.

If you repeatedly load an indirect value from a pointer, it may be helpful to cache it in a variable, so that compiler would see the value does not change between loads, and use this as an optimization opportunity. Sure compiler may miss this opportunity anyway, or find it even without cached variable (possibly with the help of the strict aliasing rule).

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
  • Thank you, this was very interesting and helpful. I'm still a bit worried about trusting the compiler to do such optimizations automatically, but I will check if these optimizations are happening first, before trying to handle such things explicitly – multitaskPro Nov 07 '21 at 08:43