Proper use of _mm256_maskload_ps for loading less than 8 floats into __m256

Question

I am having trouble wrapping my mind around which bits need to be set for masking using _mm256_maskload_ps.

The documentation states that the mask is the "integer value calculated based on the most-significant-bit of each doubleword of a mask register"

Parsing this out, I think that there are 4 64 bit integers. I want to mask 8 values so I can think of this as 8 32 bit integers (this is where my understanding gets shaky) each of which has a MSB reserved for sign, 1 being negative and 0 being positive. So I could set -1 for "please load this" and 0 for "dont load this" for 8 32 bit integers and my mask should be correct. However, we actually have 4 64 bit integers so maybe I have to pack them?

Essentially I'm looking for a way to describe a mask such that 1,2,3...8 of the first elements are set when i do _mm256_maskload_ps

Note: What's interesting is that when my mask is {-1, 0, 0, 0} the first 2 elements get set. when my mask is {0xFFFFFFFF, 0, 0, 0} only the first element gets set.

#include <iostream>
#include <immintrin.h>
#include <string>

using namespace std;

int main()
{
  float a[3] {1,2,3};
  float b[3] {11, 22, 33};

  auto disp = [](float *arr) {
    cout << "[";
    string sep;
    for (size_t i = 0; i < 3; i++)
    {
      cout << sep << arr[i];
      sep = ", ";
    }
    cout << "]";
    cout << endl;
  };
  disp(a);
  disp(b);

  __m256 _a, _b;
  __m256i _load_mask = {-1, 0, 0, 0};


  _a = _mm256_maskload_ps(a, _load_mask);
  _b = _mm256_maskload_ps(b, _load_mask);
  _a = _mm256_add_ps(_a, _b);


  float c[8];
  _mm256_storeu_ps(c, _a);
  disp(c);

  return 0;
}

Displays

[1, 2, 3]
[11, 22, 33]
[12, 24, 0]

when compiled with

!clang++ -mavx -Wall -Wextra -std=c++17 -stdlib=libc++ -ggdb % -o $(basename -s .cpp %

on my mac, where % is the filename

score 2 · Accepted Answer · answered Aug 29 '21 at 02:25

A doubleword is 32-bits, not 64. Word = 16, doubleword = 32, quadword = 64. The first two elements get selected because -1 is all ones across all 64 bits, so when the maskload treats it as two 32-bit values instead of one 64-bit value the highest bit of both elements will be set. 0xFFFFFFFF, OTOH, is the least sigificant 32 bits set and the most significant 32 bits unset. Since x86 is little-endian the least significant bits come first, which is why you end up with he first element selected but not the second.

The documentation in the intrinsics guide is much better here.

Note that on GCC/clang, __m256i is implemented using vector extensions. MSVC, however, does not support vector extensions so your code won't work there. Also, both GCC and clang use a vector of 64-bit values even though the same __m256i type is used for all integer vectors, so you'll probably want to use _mm256_set_epi32, _mm256_setr_epi32 or _mm256_load_si256 to create your _load_mask anyways.

Oh, names starting with an underscore are reserved in both C and C++. Don't do that. You can use a trailing underscore if you really need to convey that it's an internal variable or something, but I don't really see a reason to do that in tho code you've posted above.

"Oh, names starting with an underscore are reserved" - wrong, read that link again. When none of the other rules apply, "starts with an underscore" is only reserved in teh global namespace, but the only symbols OP is introducing are in local scope. — o11c, Aug 29 '21 at 02:36
@o11c: For style reasons I'd still recommend against leading underscores for var names, *especially* when using Intel intrinsics so lots of type and function names start with `_`. If you have an array `a`, `va` is a decent name for a SIMD vector loaded from it, if there isn't a more semantically meaningful short name you can come up with. — Peter Cordes, Aug 29 '21 at 15:19

score 1 · Answer 2 · edited Aug 29 '21 at 15:27

The integers stored in your __m256i type are 64 bit integers. When you use -1, that sets all 64 bits to 1 (i.e. the first two 32-bit integers in _load_mask). Using 0xFFFFFFFF will only set 32 bits, resulting in the first integer having the MSB set while the second (and the other six) won't.

You shouldn't be initializing one of the YMM registers that way. (This is non-portable, as other compilers use unions for __m256i and other SSE/AVX types and that aggregate initialization will initialize the first member of the union, likely 8 byte integers.)

You should use the appropriate intrinsic for it, in this case:

static const int32_t mask_bits[8] = { -1, -1, 0, 0, 0, 0, 0, 0};
_mm256_loadu_si256((const __m256i*)mask_bits);

If you have AVX512 support, you can use _mm256_loadu_epi32 to avoid the cast.

See this answer for an explanation.

Proper use of _mm256_maskload_ps for loading less than 8 floats into __m256

2 Answers2