-1

Given a mask and a value, the mask covers the value if all bits from the value fall into the mask.

For example:

mask:  0b011010
value: 0b010010

true

or

mask:  0b011010
value: 0b010110

false

For int arr[arr_size], I need to calculate how many elements of the array are covered by the given mask.

My code:

int count = 0;

for (int index = 0; index < arr_size; index++)
{
    // note: always true because of operator precedence 
    if (arr[index] | mask == mask)
        count++;
}

or (slower)

int count = 0;

for (int index = 0; index < arr_size; index++)
{
    // note: also broken because of operator precedence, but not always true
    if (arr[index] & mask == arr[index])
        count++;
}

My program very often needs to calculate the number of such array elements.

Can you tell me if there is any way to speed up such calculations? For example using SSE, AVX instructions.


P.S.

My code is turned into 5 instructions by the compiler (with optimizations enabled), but maybe you should use group instructions and this will give an additional speed gain


P.P.S. minimal code:

constexpt block_size = 16;
int arr[] = {rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), rand(), }; // random values for example
int mask = rand(); 

int count = 0;

for (__int64 cycles = 0; cycles < 0xFFFFFFFF; cycles ++)
{
    for (int index = 0; index < block_size; index ++)
    {
        if (arr[index] | mask == mask)
            count++;
    }
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Zhihar
  • 1,306
  • 1
  • 22
  • 45
  • 4
    Are you sure `arr[index] & mask == arr[index]` will do what you want? (see [c++ - bitwise AND with if statement - Stack Overflow](https://stackoverflow.com/questions/11074739/bitwise-and-with-if-statement)) – MikeCAT Jan 31 '21 at 15:38
  • 2
    How about [`std::count_if`](https://en.cppreference.com/w/cpp/algorithm/count)? – Some programmer dude Jan 31 '21 at 15:40
  • 1
    You might try `count += arr[index] | mask == mask;` but after compiler optimisation I guess same result will be obtained – Damien Jan 31 '21 at 15:51
  • @Damien, does not increase the speed, the compiler already minimizes into such code (even into a slightly more efficient code) – Zhihar Jan 31 '21 at 15:56
  • @Some programmer dude, checked, does not increase the speed – Zhihar Jan 31 '21 at 15:56
  • @Ted Lyngmo,I use `__popcnt` in other situations, but in this situation it only reduces the speed of computation – Zhihar Jan 31 '21 at 16:00
  • `if (arr[index] | mask == mask)` can be optimized to `if(true)` (assuming accessing `arr[index]` does not cause side effects). Before benchmarking/optimizing make sure your code actually does what you want it to do. – chtz Jan 31 '21 at 16:04
  • @MikeCAT, I checked ANDN, it does not give an increase in speed, you still need to use group operations on several – Zhihar Jan 31 '21 at 16:05
  • @chtz, my code really does exactly what I need it to do :) checked. Of course, it may not be optimal, that's why I ask this question - how to calculate how many elements triggers the mask – Zhihar Jan 31 '21 at 16:06
  • If you can use C++17 (or later) you could use `std::count_if` with a suitable [execution policy](https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t). – Some programmer dude Jan 31 '21 at 17:27
  • 1
    @Zhihar Make a [mre] We clearly don't have the full picture. – Ted Lyngmo Jan 31 '21 at 18:10
  • @Ted Lyngmo, added minimal example – Zhihar Jan 31 '21 at 18:17
  • Make something that compiles. Also, this code will probably be optimized to a no-op since you dont use or store `count`. – Ted Lyngmo Jan 31 '21 at 18:24
  • Also, do you really want to count the value 0 as a match? – Ted Lyngmo Jan 31 '21 at 18:30
  • Pretty similar to how you'd do [How to count character occurrences using SIMD](https://stackoverflow.com/q/54541129), except you only have 4x 64-bit elements to hsum instead of 32x 8. (And you end up with vectors of 64-bit counters so early overflow isn't a worry like with 8-bit elements). `_mm256_cmpeq_epi64` / `_mm256_sub_epi64` after your choice of bitwise booleans, preferably `v[i] | mask == mask` so array data is only needed once, and can be a memory source operand for `vpor`. – Peter Cordes Jan 31 '21 at 20:47
  • 2
    @Zhihar Even without `-Wall` clang warns about a tautological expression: https://godbolt.org/z/oWeb63. Also, a [mre] shall include everything necessary to compile the code (like `#include`s and function signatures). – chtz Jan 31 '21 at 22:10
  • Also note that clang auto-vectorizes nicely (exactly as I described) after fixing the operator precedence with `(v[i] | mask) == mask`. https://godbolt.org/z/Ta6jrf shows SSE2. Use `-march=haswell` for AVX2, allowing unaligned memory-source operands like `vpor ymm5, ymm0, ymmword ptr [rax + arr+4096]`. (It counts RAX up towards zero to save instructions). – Peter Cordes Feb 01 '21 at 22:42

1 Answers1

2

AVX1 is useless for this, it only does floating point stuff and you have integers there. Here’s AVX2 port of your code.

#include <immintrin.h>
#include <assert.h>

// Compute horizontal sum of all int32 lanes of the vector
inline int horizontalSum( const __m256i v )
{
    // Add 8 scalars into 4, by extracting lower/higher pieces from the source vector
    const __m128i r4 = _mm_add_epi32( _mm256_castsi256_si128( v ),
        _mm256_extracti128_si256( v, 1 ) );
    // Add 4 scalars into 2, using shift by 8 bytes
    const __m128i r2 = _mm_add_epi32( r4, _mm_srli_si128( r4, 8 ) );
    // Add 2 lowest scalars, using shift by 4 bytes
    const __m128i r1 = _mm_add_epi32( r2, _mm_srli_si128( r2, 4 ) );
    // Move the lowest int32 into scalar register
    return _mm_cvtsi128_si32( r1 );
}

int countMaskedAvx2( const int* src, size_t length, int mask )
{
    assert( length <= INT_MAX );

    const int* const srcEndAligned = src + ( length / 8 ) * 8;
    const int* const srcEnd = src + length;
    // Broadcast mask into all 8 lanes of a vector register
    const __m256i maskVector = _mm256_set1_epi32( mask );
    // Initialize counters with zero vector
    __m256i counters = _mm256_setzero_si256();

    // Handle vast majority of data with AVX2 code
    while( src < srcEndAligned )
    {
        // Load 8 scalars from the array
        __m256i v = _mm256_loadu_si256( ( const __m256i * )src );
        src += 8;
        // val | mask
        v = _mm256_or_si256( v, maskVector );
        // ( val | mask ) == mask
        v = _mm256_cmpeq_epi32( v, maskVector );
        // Increment the counters.
        // The result of that comparison instruction is 0 for false, or 0xFFFFFFFF for true.
        // Conveniently, 0xFFFFFFFF bit pattern equals to -1 signed integer.
        counters = _mm256_sub_epi32( counters, v );
    }

    // Compute horizontal sum of the counters vector
    int counter = horizontalSum( counters );

    // Handle the final 1-7 scalars of the source array
    while( src < srcEnd )
    {
        const int v = mask | ( *src );
        src++;
        const int add = ( v == mask ) ? 1 : 0;
        counter += add;
    }
    return counter;
}

If you don’t have AVX2, the same approach is doable with SSE2. You can rely on the support of that, all 64-bit x86 processors are required to support at least SSE1 and SSE2. SSE2 version will run twice as many instructions compared to AVX2, but still it gonna use 128-bit wide memory loads, and 4-wide arithmetic instructions. Even with SSE2, I would expect measurable performance improvement compared to your original code.

Update: for optimal performance of the above code, make sure your input vector is aligned by 32 bytes (AVX2) or 16 bytes (SSE2). If you have C++/11 or newer, alignas(32) is often good enough.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • all works!!! thanks, but only in Debug, not Release, why? old and new code gives identical results in Debug only – Zhihar Feb 01 '21 at 21:57
  • AVX isn't actually useless, it allows an unaligned load to fold into a memory source operand for `vpor`, saving a front-end uop vs. legacy SSE. Like `vpor xmm5, xmm0, [rax + arr+4096]` from clang `-march=sandybridge` (https://godbolt.org/z/dfrzqb) auto-vectorizing the simple C loop (with fixed operator precedence). – Peter Cordes Feb 01 '21 at 22:45
  • There's not much point manually vectorizing if you can use clang or gcc (although GCC doesn't unroll). @Zhihar. But if you have MSVC, it does a much worse job auto-vectorizing, even if you use `count += (a[i] | mask) == mask` instead of the if: https://godbolt.org/z/ebj761 it still does a manual blend with AND/ANDN/OR to select a 0 or 1 to add, instead of subtracting -1 or 0. – Peter Cordes Feb 01 '21 at 22:47
  • @Zhihar: Nothing will ever be faster than `arr[i] | mask == mask` because that evaluates to `true` at compile time. https://godbolt.org/z/z1Ydb6 shows MSVC compiles the whole function to `return arr_size`. At least in C mode; IDK why it's not the same in C++: https://godbolt.org/z/MsET1E, but clang is. (`(arr[index] & mask == arr[index])` isn't always true so has to be evaluated, but also gives wrong answers). I edited your question to note the bugs in your code blocks. – Peter Cordes Feb 01 '21 at 23:02