1

I have an array of floats and an array of booleans, where all of the floats with corresponding true values in the boolean array need to be summed together. I thought about using _mm256_maskload_pd to load each vector of floats in before summing them with an accumulator then horizontal summing at the end. However, I'm not sure how to make the boolean array work with the __m256i mask type this operation requires.

I'm very new to working with SIMD/AVX so I'm not sure if I'm going off in an entirely wrong direction. Using the 512-bit AVX is fine too, but I didn't find enough instructions that seemed useful for this.

My goal (with no SIMD) is this (Rust code but for answers I'm more interested in the intrinsics so C(++) is fine):

let mask: [bool] = ...;
let floats: [f64] = ...;

let sum = 0.0;
for (val, cond) in floats.zip(mask) {
    if cond {
        sum += val;
    }
}
  • 1
    AVX512 makes it trivial; masking is a core part of AVX-512 and can use a bit-vector directly. AVX has to expand the mask to a vector for use with `maskload`, or `vandpd` or whatever. (`_mm256_maskload_pd` is probably a good option) – Peter Cordes Jul 29 '20 at 21:35
  • Which intrinsics would you use for "masking it directly"? And I guess the expanding the mask to a vector part is what I was confused about. – Zachary Matson Jul 29 '20 at 21:43
  • 1
    I think this is a duplicate of [is there an inverse instruction to the movemask instruction in intel avx2?](https://stackoverflow.com/q/36488675) for the AVX/AVX2 part. – Peter Cordes Jul 29 '20 at 21:45
  • 1
    With AVX512, you read `__mask8` values from the `mask` array for use with [`_mm512_maskz_loadu_pd( __mmask8 k, void * s)`](https://www.felixcloutier.com/x86/movupd), i.e. chunks of 8 bits in 1 byte. Or a merge-masking `vaddpd` after a plain load, which might help the compiler fold the memory operand. Or do you have an actual unpacked array of `bool` with one `bool` per byte? If so, don't do that for AVX512! That's easier to unpack with AVX2, but less efficient for for AVX-512. – Peter Cordes Jul 29 '20 at 21:48
  • I do have an actual unpacked array of one `bool` per byte but I didn't know that about AVX512, I guess I should change that! – Zachary Matson Jul 29 '20 at 21:50
  • 1
    And yes, you'd use a vector accumulator and hsum at the end. Or for a larger array, use multiple accumulators to hide the latency bottleneck. e.g. for large arrays that are hot in L1d cache, 8 vector accumulators can theoretically keep up with `vaddpd zmm0{k1}, zmm0, [rdi]` 4c latency, 0.5c throughput to do 2 vector loads+adds per clock. (Although in practice you'll bottleneck on port 5 on moving data into mask registers :/) – Peter Cordes Jul 29 '20 at 21:50
  • 1
    Yup, creating packed masks is very easy with x86 SIMD (`movmskps`/pd and `pmovmskb` have existed since SSE1), and with AVX-512 using them becomes efficient as well. – Peter Cordes Jul 29 '20 at 21:51
  • I will take a look at all of that. Thanks for all the help/suggestions! – Zachary Matson Jul 29 '20 at 21:51

1 Answers1

2

Here’s AVX2 version in C++/17, untested.

#include <immintrin.h>
#include <stdint.h>
#include <array>
#include <assert.h>
using Accumulators = std::array<__m256d, 4>;

// Conditionally accumulate 4 lanes, with source data from a register
template<int i>
inline void conditionalSum4( Accumulators& acc, __m256d vals, __m256i broadcasted )
{
    // Unless the first register in the batch, shift mask values by multiples of 4 bits
    if constexpr( i > 0 )
        broadcasted = _mm256_srli_epi64( broadcasted, i * 4 );
    // Bits 0-3 in 64-bit lanes of `broadcasted` correspond to the values being processed

    // Compute mask from the lowest 4 bits of `broadcasted`, each lane uses different bit
    const __m256i bits = _mm256_setr_epi64x( 1, 2, 4, 8 );
    __m256i mask = _mm256_and_si256( broadcasted, bits );
    mask = _mm256_cmpeq_epi64( mask, bits );    // Expand bits into qword-s

    // Bitwise AND to zero out masked out lanes: integer zero == double 0.0
    // BTW, if your mask has 1 for values you want to ignore, _mm256_andnot_pd
    vals = _mm256_and_pd( vals, _mm256_castsi256_pd( mask ) );

    // Accumulate the result
    acc[ i ] = _mm256_add_pd( acc[ i ], vals );
}

// Conditionally accumulate 4 lanes, with source data from memory
template<int i>
inline void conditionalSum4( Accumulators& acc, const double* source, __m256i broadcasted )
{
    constexpr int offset = i * 4;
    const __m256d vals = _mm256_loadu_pd( source + offset );
    conditionalSum4<i>( acc, vals, broadcasted );
}

// Conditionally accumulate lanes from memory, for the last potentially incomplete vector
template<int i>
inline void conditionalSumPartial( Accumulators& acc, const double* source, __m256i broadcasted, size_t count )
{
    constexpr int offset = i * 4;
    __m256d vals;
    __m128d low, high;
    switch( count - offset )
    {
    case 1:
        // Load a scalar, zero out other 3 lanes
        vals = _mm256_setr_pd( source[ offset ], 0, 0, 0 );
        break;
    case 2:
        // Load 2 lanes
        low = _mm_loadu_pd( source + offset );
        high = _mm_setzero_pd();
        vals = _mm256_setr_m128d( low, high );
        break;
    case 3:
        // Load 3 lanes
        low = _mm_loadu_pd( source + offset );
        high = _mm_load_sd( source + offset + 2 );
        vals = _mm256_setr_m128d( low, high );
        break;
    case 4:
        // The count of values was multiple of 4, load the complete vector
        vals = _mm256_loadu_pd( source + offset );
        break;
    default:
        assert( false );
        return;
    }
    conditionalSum4<i>( acc, vals, broadcasted );
}

// The main function implementing the algorithm.
// maskBytes argument is densely packed mask values with 1 bit per double, the size must be ( ( count + 7 ) / 8 )
// Each byte of the mask packs 8 boolean values, the first value of the byte is in the least significant bit.
double conditionalSum( const double* source, const uint8_t* maskBytes, size_t count )
{
    // Zero-initialize all 4 accumulators
    std::array<__m256d, 4> acc;
    acc[ 0 ] = acc[ 1 ] = acc[ 2 ] = acc[ 3 ] = _mm256_setzero_pd();

    // The main loop consumes 16 scalars, and 16 bits of the mask, per iteration
    for( ; count >= 16; source += 16, maskBytes += 2, count -= 16 )
    {
        // Broadcast 16 bits of the mask from memory into AVX register
        // Technically, C++ standard says casting pointers like that is undefined behaviour.
        // Works fine in practice; alternatives exist, but they compile into more than 1 instruction.
        const __m256i broadcasted = _mm256_set1_epi16( *( (const short*)maskBytes ) );

        conditionalSum4<0>( acc, source, broadcasted );
        conditionalSum4<1>( acc, source, broadcasted );
        conditionalSum4<2>( acc, source, broadcasted );
        conditionalSum4<3>( acc, source, broadcasted );
    }

    // Now the hard part, dealing with the remainder
    // The switch argument is count of vectors in the remainder, including incomplete ones.
    switch( ( count + 3 ) / 4 )
    {
    case 0:
        // Best case performance wise, the count of values was multiple of 16
        break;
    case 1:
    {
        // Note we loading a single byte from the mask instead of 2 bytes. Same for case 2.
        const __m256i broadcasted = _mm256_set1_epi8( (char)*maskBytes );
        conditionalSumPartial<0>( acc, source, broadcasted, count );
    }
    case 2:
    {
        const __m256i broadcasted = _mm256_set1_epi8( (char)*maskBytes );
        conditionalSum4<0>( acc, source, broadcasted );
        conditionalSumPartial<1>( acc, source, broadcasted, count );
        break;
    }
    case 3:
    {
        const __m256i broadcasted = _mm256_set1_epi16( *( (const short*)maskBytes ) );
        conditionalSum4<0>( acc, source, broadcasted );
        conditionalSum4<1>( acc, source, broadcasted );
        conditionalSumPartial<2>( acc, source, broadcasted, count );
        break;
    }
    case 4:
    {
        const __m256i broadcasted = _mm256_set1_epi16( *( (const short*)maskBytes ) );
        conditionalSum4<0>( acc, source, broadcasted );
        conditionalSum4<1>( acc, source, broadcasted );
        conditionalSum4<2>( acc, source, broadcasted );
        conditionalSumPartial<3>( acc, source, broadcasted, count );
        break;
    }
    }

    // At last, compute sum of the 16 accumulated values
    const __m256d r01 = _mm256_add_pd( acc[ 0 ], acc[ 1 ] );
    const __m256d r23 = _mm256_add_pd( acc[ 2 ], acc[ 3 ] );
    const __m256d sum4 = _mm256_add_pd( r01, r23 );
    const __m128d sum2 = _mm_add_pd( _mm256_castpd256_pd128( sum4 ), _mm256_extractf128_pd( sum4, 1 ) );
    const __m128d sum1 = _mm_add_sd( sum2, _mm_shuffle_pd( sum2, sum2, 0b11 ) );
    return _mm_cvtsd_f64( sum1 );
}

Couple interesting points.

I unrolled the loop by 16 values, and using 4 independent accumulators. Increases bandwidth because pipelining. Makes loop exit checks less often i.e. more instructions are spent computing useful stuff. Reduces frequency of broadcasting the mask values, moving data from scalar to vector units had some latency. Note I only load from the mask once per 16 elements, and reusing the vector by bit shifting. Also increases precision, when you add small float value to large float value the precision is lost, 16 scalar accumulators help.

Correctly handling these remainders, without moving data from registers to memory and back, is complicated, need partial loads and such.

If you move the integer value from template argument into normal integer arguments the code will likely stop compiling, the compiler will say something like “expected constant expression”. The reason for that, many SIMD instructions, including _mm256_srli_epi64, encode constants as parts of instruction, so the compiler needs to know these values. Another thing, array index needs to be constexpr or the compiler will evict array from 4 registers into RAM, to be able to do pointer math when you access the values. Accumulators need to stay in registers all the time or the performance will decrease by a huge factor, even L1D cache is much slower than registers.

Here’s the output of gcc. The assembly seems reasonable, the compiler has successfully inlined everything, and the only memory access in the main loop is for the source values. The main loop is below the .L3 label there.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • `_mm256_blendv_pd` costs 2 uops, and puts the blend operation on the critical path (loop carried dep chain). Normally better to just maskload or something so you add `0.0` to elements where the mask is false. At worst, blend with `0.0` before the add, so it's off the critical path, and part of preparing an input vector from memory, but I think a `_mm256_maskload_pd` (`vmaskmovpd`) is more efficient on most CPUs. Only 2 uops total including the load on Skylake, 1 on Zen2. – Peter Cordes Jul 30 '20 at 13:53
  • @PeterCordes vmaskmovpd latency is huge on AMD, 50 cycles. Vector blend’s 2 uops is probably faster than equivalent multiple instructions, you gonna need two, vpcmpeqq and vpand. – Soonts Jul 30 '20 at 13:59
  • https://www.uops.info/html-lat/ZEN2/VMASKMOVPD_YMM_YMM_M256-Measurements.html shows that on Zen2 it's <= 9 cycles. You're right about Zen1, though; it's very bad there, like 24 uops and up to 17 cycle latency; I hadn't enabled that column on https://www.uops.info/table.html?search=maskmovpd&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_ICL=on&cb_ZEN%2B=on&cb_ZEN2=on&cb_measurements=on&cb_base=on&cb_avx=on until just now. But anyway, if you want to avoid maskload, then you can still blend with zeros off the critical path. – Peter Cordes Jul 30 '20 at 14:03
  • Also, instead of `_mm256_sllv_epi64` + `blendv`, you might just VPAND / VPCMPEQQ to generate a mask, and use it with VANDPD. (And still right-shift the vector to line up with a fixed 0, 1<<1, 1<<2, 1<<3 mask vector). Or maybe bit-reverse 64-bit chunks of mask once outside the loop so can hoist an `sllv` 0,1,2,3. Or just hoist an srlv 0,1,2,3 and use `srli` + `slli` 63 inside the loop? Better on Haswell where `sllv` is multiple uops. – Peter Cordes Jul 30 '20 at 14:10
  • You could gain a bit of ILP that probably won't matter (but might help with catching up from a cache-miss on the mask array): Make the loop-carried `broadcasted` dep chain for `_mm256_srli_epi64` shift 3 different ways (4, 8, 12), updating the original only with the last one. So the last group of 4 bits is only 64/16 = 4 shifts away instead of 64/4 = 16 shifts. OTOH there's more than 1 cycle worth of ALU work to do after each shift result is ready, so it probably won't be the bottleneck ever. – Peter Cordes Jul 30 '20 at 14:48
  • And you'd have to pull the shifting out of the template. But then you could pull the array indexing, too, and make it not templated, just using the low 4 bits of an input mask vector. So then all the mask reload and shift logic could be in one place. – Peter Cordes Jul 30 '20 at 14:48
  • Oh nvm, you're only loading 16 total, not 64, that's why it's fine to only shift by a total of 3*4 = 12 bits, not 16. That's hard for a compiler to do efficiently; the narrowest zero-extending SIMD load is 32 bit `movd`. Also, `*( (const uint16_t*)maskBytes )` violates strict-aliasing UB. You could typedef a 16-bit `may_alias` type... Or really I'd recommend doing a 32-bit broadcast so the compiler can use `vpbroadcastd` with a memory source, grabbing high bits even though you don't care about them. (Then might as well use all 32, with an outer loop or something.) – Peter Cordes Jul 30 '20 at 14:55
  • @PeterCordes I think I’m already doing that? `broadcasted = _mm256_srli_epi64( broadcasted, i * 4 );` code writes to a temporary variable that’s discarded when control leaves the function. It’s passed by value not by reference. The compiler actually did what I wanted, see lines 44-45 in assembly, it’s shifting by 8 and 12 bits in sequential instructions. – Soonts Jul 30 '20 at 14:56
  • Oh yes, right it is `i*4` shift count. I had confirmation bias from one wrong assumption early on (that you were loading 64 bits of mask with an efficient 64-bit broadcast instead of this inefficient 16-bit thing), and didn't scroll right to check the arg declaration in the SO code block. I just made assumptions about the caller based on mostly looking at the template function. Anyway, notice the nasty `movzx` into `r9d`, then `vmovq` into xmm0, then `vpbroadcastq`. Instead of one broadcast-load from memory which is a single uop handled by a load port (for dword and larger; vpbroadcastw=2) – Peter Cordes Jul 30 '20 at 14:58
  • @PeterCordes Indeed, updated once more. BTW broadcasting bytes and words is 1 µop on AMD. I have Zen 2 in my main PC here, so I have a bias :-) – Soonts Jul 30 '20 at 15:17
  • `*(const short*)maskBytes` is even worse than `uint16_t`, now you have to sign-extend. (On some AMD CPUs, `movzx` is a pure load but `movsx` needs an ALU execution unit as well as a load.) And does nothing to avoid the strict-aliasing UB. I think you changed that for consistency with with the partial-vector stuff, but those should probably all use `uint16_t` or `uint8_t`. – Peter Cordes Jul 30 '20 at 15:21
  • To avoid UB, you could use `memcpy`, or a GNU C `__attribute__((may_alias))` typedef version of `uint16_t`. [How to implement "\_mm\_storeu\_epi64" without aliasing problems?](https://stackoverflow.com/posts/comments/108065664) / and [this](//stackoverflow.com/a/57676035). Or you could use a 32-bit load intrinsic to load straight into a vector reg and give compilers a chance to optimize into an efficient broadcast-load. (Although then you could over-read the mask array, which could be dangerous if it ends at a page boundary. Unless you just unroll more to use all 32 bits.) – Peter Cordes Jul 30 '20 at 15:24
  • @PeterCordes No, it’s fine. Look at the disassembly, compiles into `vpbroadcastw ymm0, WORD PTR [rsi]` – Soonts Jul 30 '20 at 15:24
  • I don’t care too much about what C++ language designers consider UB. The code only builds on AMD64 anyway, no reasons to worry what gonna happen on other platforms. – Soonts Jul 30 '20 at 15:26
  • Yes, when not inlined of course it compiles fine; it has to work for the case where `const uint8_t* maskBytes` actually points to `uint16_t` objects. But unless all other code in your program stores to it that way, you'll have UB. Undefined Behaviour doesn't mean "guaranteed to fail", it means "anything can happen" including working like you wanted it to with *some* surrounding code, but breaking with others. The failure mode could be the compiler assuming that a different-sized store to the array is independent of loads in this loop. (`char` being a may_alias type makes that less likely...) – Peter Cordes Jul 30 '20 at 15:27
  • It's definitely possible to construct cases where strict-aliasing UB makes GCC for AMD64 compile a program differently from what you expect. (e.g. [Why compilers no longer optimize this UB with strict aliasing](https://stackoverflow.com/q/34527729) shows some older versions of GCC apply it "blindly"). The most efficient solution would be a dword load intrinsic and using all 32 bits. Another solution would be to punt the problem to the caller by declaring the arg as `const uint16_t*`. The OP actually wants Rust, and I don't know what Rust has for type-punning. – Peter Cordes Jul 30 '20 at 15:31
  • @PeterCordes With dword loads you’ll have same UB when casting pointers from bytes. With larger pointers on input the API becomes less intuitive IMO. Logically, the pointer is to the sequence of bits. Byte is the minimum addressable unit if data, with uint16_t you gonna need to specify byte order and padding in some API documentation. – Soonts Jul 30 '20 at 15:46
  • Not if you use a dword load *intrinsic* that is aliasing-safe, like `_mm_loadu_si32` or `_mm_load_ss`. Despite taking a `float*`, `_mm_load_ss` treats it as a special may-alias float. See my comment in [How to implement "\_mm\_storeu\_epi64" without aliasing problems?](https://stackoverflow.com/posts/comments/108065664) - https://godbolt.org/z/sdKENn is a work-in-progress mess, but contains some experiments that show the effect of making aliasing not UB (blocking dead store elimination, because it's not dead of a load might be reloading it). – Peter Cordes Jul 30 '20 at 15:59
  • The situation around strict-aliasing-safe loads with Intel intrinsics is a mess, though, as far as such intrinsics all being portable across all compilers. Hopefully Rust does a better job with a fresh start. – Peter Cordes Jul 30 '20 at 16:02
  • @PeterCordes Using these intrinsics will fix UB at the cost of unneeded instructions. Current code in my answer, with `_mm256_set1` intrinsics, does broadcasted loads, 1 instruction which on AMD decodes into 1 µop. As for Rust I don’t have high expectations, I’m not sure it can possibly do any better than llvm underneath it. I don’t think it can abstract away these SIMD shenanigans without also loosing performance of the compiled code. – Soonts Jul 30 '20 at 16:20
  • Oh, I didn't notice you'd changed to `_mm256_set1_epi16` instead of epi64x. Yes, then you can broadcast from memory, and is probably worth the minor risk, especially on AMD. (Worth a comment, though.) And yes, most compilers aren't good enough to optimize `_mm_loadu_si32` + broadcast into a memory-source `vpbroadcastd`, so the poor design of Intel intrinsics (lacking in narrow memory-source intrinsics) is a problem even for dword, and nearly unfixable for word (16-bit) broadcasts if you want safety. – Peter Cordes Jul 30 '20 at 16:26
  • Rust could portably expose aliasing-safe narrow broadcasts and loads, and then LLVM's back-end could optimize them the same way it does current when you manage to jump through GNU C hurdles to communicate that from C source using GNU extensions. A better intrinsics API that included portable aliasing-safe loads would fix basically everything in C, too, avoiding the need to do clunky stuff that doesn't get optimized away. – Peter Cordes Jul 30 '20 at 16:27