0

I am trying to speed-up a bitwise OR operation for very long binary vectors using integers of 32 bit.

In this example we can assume that nwords is the number of words and it is a multiple of 4 and 8. Hence, no loop reminder. This binary vector can contain many thousands of bits.

Moreover all three bit vectors are allocated using _align_malloc() with alignment at 16 and 18 bits, for SSE2 and AVX2, respectively.

To my surprise, the following three scalar, SSE2 and AVX2 codes executed using exactly the same amount of time on my i7 CPU. I didn't experience the expected x4 and x8 speed-up of SSE2 and AVX2 registers.

My MVisual Studio verson is 15.1.

Scalar code:

void vectorOr_Scalar(unsigned int *ptr1, unsigned int *ptr2, unsigned int *out, int nwords)
{
    for (end = ptr1 + nwords; ptr1 < end; ptr1++, ptr2++, out++) *out = *ptr1 | *ptr2;
}

SSE2 code:

void vectorOr_SSE2(unsigned int *ptr1, unsigned int *ptr2, unsigned int *out, int nwords)
{
    for (i = 0; i < nwords; i += 4, ptr1 += 4, ptr2 += 4, out += 4)
    {
        __m128i v1 = _mm_load_si128((__m128i *)ptr1);
        __m128i v2 = _mm_load_si128((__m128i *)ptr2);
        _mm_store_si128((__m128i *)out, _mm_or_si128(v1, v2));
    }
}

AVX2 code:

void vectorOr_AVX2(unsigned int *ptr1, unsigned int *ptr2, unsigned int *out, int nwords)
{
    for (i = 0; i < nwords; i += 8, ptr1 += 8, ptr2 += 8, out += 8)
    {
        __m256i v1 = _mm256_load_si256((__m256i *)ptr1);
        __m256i v2 = _mm256_load_si256((__m256i *)ptr2);
        _mm256_store_si256((__m256i *)out, _mm256_or_si256(v1, v2));
    }
}

Perhaps is this application not fitting well for vectorization due to the limited amount of register operations between loads and stores?

Liotro78
  • 111
  • 5
  • 4
    Compilers are pretty clever these days, are you sure they're not generating similar code for your scalar case? – Mark Ransom Nov 13 '19 at 15:38
  • 1
    0. Check you have AVX2 enabled with optimizations (otherwise compiler will call for vector replacement functions, without hardware boost ). 1. Could you try to use OpenMP just add `#pragma omp parallel for simd` – Victor Gubin Nov 13 '19 at 15:39
  • 3
    As @MarkRansom said, your compiler likely figured out itself how to vectorize you function. Look for `vorps` or `vpor` instructions in the hot-loop: https://godbolt.org/z/MR2x8V – chtz Nov 13 '19 at 15:41
  • Hello guys, yes it seems like Mark said. Thanks a lot! – Liotro78 Nov 13 '19 at 15:46
  • 1
    I looked for a duplicate; I'm sure I've seen Q&As where the answer was that the scalar version was simple enough to auto-vectorize but didn't manage to find any. – Peter Cordes Nov 13 '19 at 17:07
  • The keyword is [tag:auto-vectorization] – phuclv Nov 13 '19 at 20:52
  • some possible duplicates: [Why my SSE code is slower than native C++ code?](https://stackoverflow.com/q/54581729/995714), [Why AVX dot product slower than native C++ code](https://stackoverflow.com/q/46502748/995714), [SIMD code runs slower than scalar code](https://stackoverflow.com/q/4394930/995714) – phuclv Nov 17 '19 at 04:58

1 Answers1

1

The reason you don't observe performance difference between the loop that processes one unsigned at a time and a SIMD loop that processes 8 unsigned at a time is because the compilers generate SIMD code for you, as well as unroll the loop, see the generated assembly.

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271