Give the CLANG compiler a loop length assertion

Question

I have a loop that loads two float* arrays into __m256 vectors and processes them. Following this loop, I have code that loads the balance of values into the vectors and then processes them. So there is no alignment requirement on the function.

Here is the code that loads the balance of the data into the vectors:

size_t constexpr            FLOATS_IN_M128              = sizeof(__m128) / sizeof(float);
size_t constexpr            FLOATS_IN_M256              = FLOATS_IN_M128 * 2;

...

assert(bal < FLOATS_IN_M256);

float ary[FLOATS_IN_M256 * 2];    
auto v256f_q = _mm256_setzero_ps();
_mm256_storeu_ps(ary, v256f_q);
_mm256_storeu_ps(&ary[FLOATS_IN_M256], v256f_q);   
float *dest = ary;
size_t offset{};

while (bal--)
{
    dest[offset] = p_q_n[pos];
    dest[offset + FLOATS_IN_M256] = p_val_n[pos];
    offset++;
    pos++;
}

// the two vectors that will be processed
v256f_q = _mm256_loadu_ps(ary);
v256f_val = _mm256_loadu_ps(&ary[FLOATS_IN_M256]);

When I use Compiler Explorer, set to "x86-64 clang 16.0.0 -march=x86-64-v3 -O3" the compiler unrolls the loop when the assert(bal < FLOATS_IN_M256); line is present. However, assert() is ignored in RELEASE mode, meaning the loop won't be vectorized and unrolled.

To test, I defined NDEBUG and the loop is vectorized and unrolled.

I have tried adding the following in the appropriate places, but they don't work:

#pragma clang loop vectorize(enable)
#pragma unroll
#undef NDEBUG

The compiler should be able to see from the code before the snippet above that bal < 8 but it doesn't. How can I tell it this assertion is true when not in DEBUG mode?

It vectorizes only when you tell it the iteration count is smaller than one vector? More often I've wanted to make sure the compiler *didn't* try to vectorize "cleanup" loops for the leftover 0..n-1 elements, at least not with full-width vectors. (Different ways of writing the cleanup loop iteration can help, e.g. deriving it from a modulo or bitmask like `len & -8` or something rather than `len - full_vector_iterations*8`) — Peter Cordes, Jun 02 '23 at 19:10
Also, what final result is this getting, exactly? The last `bal` elements of an input zero-extended into full vectors? This looks like it would create a store-forwarding stall if compiled as-written, and be slower than an unaligned load and mask or just letting it overlap, or `vmaskmovps` if you need fault-suppression. [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933). Can you link on https://godbolt.org/ what asm you want clang to turn this into? — Peter Cordes, Jun 02 '23 at 19:11
Yes, it vectorizes when I tell it the count is less than a vector; otherwise, it compiles as a loop. In the end, I went a different route: __m128 for >= 4 elements and a switch with fall-through for the potential 3 remaining elements. By fr the shortest code and no loop. — IamIC, Jun 04 '23 at 12:41
@PeterCordes Thank you for hinting at `_mm_maskload_ps`. I'd need to test if that could be faster for <= 3 elements (considering the need to create the mask) vs. my fallthrough switch statement. — IamIC, Jun 04 '23 at 13:03
It's confusing to have a helper function called `mm256_movemask_ps` that's the inverse (mask to vector) of the intrinsic `_mm256_movemask_ps`. A better name like `bitmask_to_m256` would have been much more quickly readable. Also, I hope you noticed that [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) demonstrates a more efficient way to turn a shift-count into a vector mask with just a load of a 32-byte window from a 64-byte aligned array of constants. — Peter Cordes, Jun 04 '23 at 14:18
Anyway yeah, a small switch is a good way to get a compiler to fully unroll and be sure it's non-looping. But the reload causes a store-forwarding stall. It's too bad SSE/AVX don't have very efficient ways to turn a count into a vector mask or shift/shuffle, like a variable-count integer `vpslldq` (byte shift), but you can do that with a load of an unaligned window from an array of constants. — Peter Cordes, Jun 04 '23 at 14:22
Can't I simply use `_mm256_maskload_ps()` to load the remaining elements? I already coded this and it works, but as I understand it, the array must be aligned for this to work. To be very specific, if I have a `float*` pointing to 11 elements, I'm using `_mm256_loadu256()` for the first 8 and `_mm256_maskload_ps()` with mask 0b111 for the last 3. Is this acceptable? — IamIC, Jun 04 '23 at 14:29
Re. `mm256_movemask_ps` I agree. I modeled that name after the AVX512 intrinsic, which is even more confusing. But I'm taking your suggestion. — IamIC, Jun 04 '23 at 14:30
The AVX-512 intrinsic for `vpmovm2d ymm, k` is `_mm256_movm_epi32(__mmask8)`. I could see inventing a `mm256_movm_ps` version, but not `movemask`. But like I said, for helper functions that aren't real intrinsics for single instructions, I prefer *not* following the `_mm256_...` naming scheme. If readers have to check everything that looks like an intrinsic to see if it actually is one, that requires a lot more attention than just reading enough to differentiate existing intrinsics. — Peter Cordes, Jun 04 '23 at 14:42
Yes, you can use `_mm256_maskload_ps`. Masked loads are decently efficient on all CPUs. You do still need to generate a vector mask from that integer bitmask, though, or from the integer count directly as an offset for a sliding window from an array of `{-1,-1,-1,..., 0,0,0,0, ...}`. — Peter Cordes, Jun 04 '23 at 14:47
Other ways are possible, like `_mm256_loadu_ps` to load the last 8 floats of the array, and either re-process the overlapping 5, or mask them to zero (after generating a vector mask somehow), or a vector shuffle that shuffles the 3 elements you want to the bottom of the array. Doing a final possibly-overlapping final vector is most efficient for non-reductions where processing an element twice just means an overlapping store into separate output, which works fine as long as the total input size was at least 1 full vector. Else you need a maskstore which is very slow on AMD. — Peter Cordes, Jun 04 '23 at 14:49
Will `_mm256_maskload_ps` care about an unaligned source? It definitely is faster than any other option, including the switch. Regarding `_mm256_loadu_ps`, if it loads past the array's end, surely that could trigger a fault? — IamIC, Jun 04 '23 at 14:51
One of the major points of `_mm256_maskload_ps` doing fault suppression for masked elements is to allow unaligned loads that might cross into a new page. There's no alignment-required version of it. https://www.felixcloutier.com/x86/vmaskmov . Re: `_mm256_loadu_ps` - exactly, that's why you calculate a load address so the last element of `_mm256_loadu_ps` is the last element of the whole array, not going past it. If the array size isn't a multiple of 8, you'd overlap some *earlier* elements you've already loaded, instead of going outside the array. (Unless total size < 8 as I said.) — Peter Cordes, Jun 04 '23 at 15:03
Then it seems using `_mm256_loadu_ps` with `_mm256_maskload_ps` where needed is wise. Regarding your question about whether I'd seen the more efficient way of converting a count to a mask, I followed the link and implemented `bitmap2vecmask()` but the result was different (I understood that the input changes to a count rather than the shifted bit mask). Did I use the new function you intended? — IamIC, Jun 04 '23 at 15:09
From [is there an inverse instruction to the movemask instruction in intel avx2?](https://stackoverflow.com/q/36488675) I guess you mean, using a variable-count shift to get inputs to the MSB of each dword? No, that takes a mask, it's answering the general question of mask to vector. For a count, you want either a load like in my VMASKMOVPS answer, or something like `_mm256_cmpgt_epi32( set1_epi32(count), setr_epi32(0, 1, 2, 3, 4, 5, 6, 7))` so e.g. the low element is set when `count > 0`, the second element is set when `count > 1` etc. — Peter Cordes, Jun 04 '23 at 15:17
Mmm that looks much like what I have (which is from one of your answers a few years ago); bits-to-mask: v256i_mask = _mm256_setr_epi32(1L << 0, 1L << 1, 1L << 2, 1L << 3, 1L << 4, 1L << 5, 1L << 6, 1L << 7); v256i_x = _mm256_set1_epi32(k); v256i_x = _mm256_and_si256(v256i_x, v256i_mask); return _mm256_cmpeq_epi32(v256i_x, v256i_mask); — IamIC, Jun 04 '23 at 15:23
Yes, if you want to waste a few instructions generating an integer mask like `0b00000111` from a shift count, then yeah you could use that version if you need all the bits set or clear in each element, not just the MSB for blends. Using the count directly saves some integer instructions, and after broadcasting an integer is just `vpcmpgtd` instead of `vpand` / `vpcmpeqd` — Peter Cordes, Jun 04 '23 at 15:33
I get you. I can only deduce you are referring to the `bitmap2vecmask()` method on the page you referenced, right? — IamIC, Jun 04 '23 at 15:44
But why would you want to do that in your code? I thought your only mask came from `(1< — Peter Cordes, Jun 04 '23 at 15:44
I changed the inverse mask move to your suggested MSB-only code. I changed the call sight to pass the number of bits, not the shifted bits. The code stops working. inline __m256 mm256_cvtmask_ps(int const n) noexcept(true) { const __m256i vshift_count = _mm256_set_epi32(24, 25, 26, 27, 28, 29, 30, 31); __m256i bcast = _mm256_set1_epi32(n); __m256i shifted = _mm256_sllv_epi32(bcast, vshift_count); return _mm256_castsi256_ps(shifted); } — IamIC, Jun 04 '23 at 15:50
An input of 2 on that function produces `0.00000000, -0.00000000, 2.00000000, 1.08420217e-19, 2.52435490e-29, 3.85185989e-34, ...` (as floats). I would expect `-0.0, -0.0, 0.0, 0.0...` My code requires the mask to reflect `(1< — IamIC, Jun 04 '23 at 15:54
*I changed the call sight [site] to pass the number of bits, not the shifted bits* - You keep getting this backwards. All the functions in *[is there an inverse instruction to the movemask instruction in intel avx2?](https://stackoverflow.com/q/36488675)* takes masks, not counts. That's why I suggested using something based on `cmpgt_epi32` instead because I thought you had a count. Look at how those functions work. `sllv_epi32` with that constant puts each of the low 8 bits of the input integer as the MSB of a vector element. It takes a mask, not a count. — Peter Cordes, Jun 04 '23 at 15:57
If your `p_modes` is actually an arbitrary mask sometimes, not always set like `p_modes &= (1ULL << length) - 1;`, then yeah you need the `sllv` version. Otherwise you can use the `cmpgt` version with a count. `sllv` output isn't always +0.0 or -0.0, the point is that other bits can be arbitrary garbage from the rest of the mask, not wasting another instruction zeroing them or setting them the same as the MSB, so only look at the sign if you choose to display them as floats32. `blendv_ps` *only* looks at the MSB, i.e. the sign bit of a float. — Peter Cordes, Jun 04 '23 at 16:01
I tried it both ways initially and again now. I passed in length = 2 (mask 0b11) and I get the first 2 MSBs set, as expected. So it works. But my code no longer produces the correct results. I'll have to explore more to see which part expects more than the MSB. — IamIC, Jun 04 '23 at 16:05
Oh, I was misreading `p_modes &= (1ULL << length) - 1;` That's an `&=` not just `=`. I don't know what kind of mask inputs your caller passes for `p_modes`. If those aren't ultimately just based on a count, then you do need the `sllv` way or equivalent. (It's possible I got the vector constant backwards or something, or that your code uses bit-reversed masks.) — Peter Cordes, Jun 04 '23 at 16:29
I figured out my mistake. Your function (well, both are from you) works fine. My code had a different place where I relied on all the bits being 1/0. For blends, I can use the faster one, as you state. So, in summary, I deleted the switch and replaced it with maskmove and use the faster inverse mask move where I can. — IamIC, Jun 04 '23 at 16:33
`p_modes &= (1ULL << length) - 1;` is a protection against improper high bits. If those are inadvertently set, the function's result will be nonsense. Each bit represents a first-stage rank mode. The bit field length should match count. — IamIC, Jun 04 '23 at 16:36
@PeterCordes I just realized that `_mm256_sllv_epi16` is an AVX512 intrinsic, but my CPU is AVX2. Oddly, it compiles under VS despite a _not found_ error with CLANG. So I don't know what the compiler is doing to make the code execute. — IamIC, Jun 06 '23 at 23:01
I checked back to the answer you linked, and you have `vpsllvw` under "Skylake." Then I checked Agner Fog's instruction tables, and the op runs on SkylakeX and other AVX512 CPUS only, as expected. So I am really curious as to what is happening in VS with CLANG when I do not have AVX512 enabled in the compiler and my CPU can't run that code. And yet I was able to run in debug and get the correct results. — IamIC, Jun 06 '23 at 23:16
You're doing 32-bit chunks to match up with 32-bit floats, like my linked answer which uses `vpsllvd` (dword = 32-bit, word = 16-bit). The intrinsic for that is `_mm256_sllv_epi32`. The reason my answer recommended it for Skylake (not Haswell where AVX2 was new) is performance, not compatibility. It's a single uop on Skylake and later Intel. AVX2 has 32 and 64-bit granularity of variable-count shifts, AVX-512 is only required for 16-bit element size. Look at the CPUID feature flag column in https://www.felixcloutier.com/x86/vpsllvw:vpsllvd:vpsllvq — Peter Cordes, Jun 06 '23 at 23:25
That's MSVC's "fault". It doesn't require you to enable instruction sets before using them. A [different design philosophy than GCC / clang](https://stackoverflow.com/a/55748439/224132) concerning ISA options, and connected to the fact that it can't optimize intrinsics into different instructions but GCC and clang can. https://godbolt.org/z/3EnbddbKT shows GCC and clang `-march=x86-64-v3` both compiling `_mm256_sllv_epi32` just fine. See also [What exactly do the gcc compiler switches (-mavx -mavx2 -mavx512f) do?](https://stackoverflow.com/q/71229343) — Peter Cordes, Jun 06 '23 at 23:44
I find MSVC useful only for DEBUG (faster compile than CLANG). But I use CLANG for RELEASE code. MSVC's output tends to look terrible next to CLANG, yet it is inexplicably faster for some functions, considerably so in certain cases (vectorized code). Overall, it's slower, though, but only slightly. So I'd say it can implement impressive optimizations in certain cases. I have read it claims to be able to do the optimization you say it can't. — IamIC, Jun 07 '23 at 00:04
Interesting, I know I've seen cases where MSVC didn't constant-propagate or optimize through some intrinsics, maybe more complicated ones like shuffles. But for simple pure-vertical ones like addition and shift, yes it can do some optimizations. https://godbolt.org/z/xdx5MnKoe shows clang vs. GCC vs. MSVC code-gen. Clang is the most aggressive at rewriting intrinsics, just like scalar `+` isn't always an `add` instruction, it can turn `slli` / `srli` shifts into masking off the low and/or high bits of each element with a `vpand` instruction, but GCC and MSVC miss that optimization. — Peter Cordes, Jun 07 '23 at 00:26
But that example also shows MSVC being more literal, like `_mm256_sllv_epi32(v, set1(3))` compiling to use a vector constant instead of an immediate `3` for all elements. (GCC also misses that optimization; it's hopefully not high value the constants are hidden behind templates and helper functions, so maybe the same code can sometimes inline into callers where they're all the same.) But GCC (and clang) can turn a `_mm_mullo_epi32(v, set1(8))` into a shift by 3, much cheaper, which MSVC misses. — Peter Cordes, Jun 07 '23 at 00:29
All these optimizations were to the same ISA feature level, all AVX2. In clang, if you enable AVX2, it can potentially optimize multiple AVX1 shuffles into one AVX2 shuffle. Clang is the only compiler with a shuffle optimizer that tracks vector elements through shuffles to see what's actually happening, and re-invents its own shuffles. Often that's good, but not if you were trying to micro-optimize for a specific uarch and clang's choice isn't what's optimal. This often lets clang optimize away the zero-extension in `_mm_set_ps(float)` if you later only use the low element (e.g. to merge). — Peter Cordes, Jun 07 '23 at 00:32

score 6 · Accepted Answer · answered Jun 02 '23 at 13:08

6

You can use __builtin_assume to give the compiler constraint information that is not explicitly in the code. This should work for gcc and clang.

In the posted code, just replace the assert with __builtin_assume(bal < FLOATS_IN_M256).

answered Jun 02 '23 at 13:08

RandomBits

4,194
1
17
30

Thank you much! Bard, ChatGPT, and Google couldn't answer this. – IamIC Jun 02 '23 at 13:11

Give the CLANG compiler a loop length assertion

1 Answers1