How to optimize array iteration in C for intel processors

Question

Tthe function difference may compute the amount of unequal elements of two char arrays of the length n:

size_t difference(size_t n, const char a[n], const char b[n]) {
    size_t res = 0;
    for (size_t i = 0; i < n; i++)
        res += a[i] != b[i];
    return res;
}

How can I make the iteration faster regarding the memory allocation of the arrays and the caches? Is there a way to use SSE intel intrinsics?

EDIT:

This is a standalone program which may use as few cycles as possible and use no compiler directives/pragmas.

In C++ you could try `std::count_if` with an execution policy of `par_unseq` . — wcochran, Jun 10 '23 at 14:06
BTW Clang and GCC vectorize this, in different ways, but [both not great](https://godbolt.org/z/haveWMKT7). So yes, it is useful to do this manually. — harold, Jun 10 '23 at 14:34
Similar to [How to count character occurrences using SIMD](https://stackoverflow.com/q/54541129/555045). Also based on compare, subtract, `_mm_sad_epu8` (but not on every iteration of the inner loop), etc. In your case you would load characters from two sources, and since you want to count non-matches, subtract from `n` in the end. — harold, Jun 10 '23 at 14:57
Too bad you say you can't use pragmas... OpenMP's SIMD mode would be useful. — Shawn, Jun 10 '23 at 15:46
@Shawn how do you make it do something useful? I tried it like [this](https://godbolt.org/z/obzYhvsaK) but that didn't do anything. — harold, Jun 10 '23 at 16:07
Like that, yes. I don't know what you mean by "didn't do anything", clearly both compilers vectorize the loop. clang better than gcc. — Shawn, Jun 10 '23 at 16:20
@Shawn it's vectorized the same way as without the pragma, apart from some unimportant details — harold, Jun 10 '23 at 16:31

score 0 · Answer 1 · answered Jun 10 '23 at 13:57

0

Make your n-byte array as blocks of 16 signed or unsigned 8-bit integers and use __m128i _mm_cmpeq_epi8(__m128i a, __m128i b);

When each byte of a is not equal to the corresponding byte of b, you will get a 0 in your result register. In the end of each block analysis, count number of zeros and add theme up to your res value.

Don't forget to include emmintrin.h header file.

Note: you may get a big performance boost, when your n is large.

answered Jun 10 '23 at 13:57

Iman Abdollahzadeh

279
2
12

1

You should not count the zeros and add them at the end of each block analysis. That uses needless cross-element operations… – Eric Postpischil Jun 10 '23 at 14:36
… Instead: The compare instructions produce all 0 bits (comparison false) or all 1 bits (true) in each element, and all 1 bits is equivalent to −1. So we can simply do an element-wise add for at least 255 blocks. That will accumulate a count of the number of true results. After doing many blocks, we can accumulate those into wider elements and repeat and eventually combine into one total or we can accumulate them into one total, skipping the wider-element step. The number of false results can be computed from the number of true results and the number of elements compared. – Eric Postpischil Jun 10 '23 at 14:36
Could also use `_mm_movemask_epi8()` and a popcount to get the number of matches. – Shawn Jun 10 '23 at 15:34
@EricPostpischil: [Can counting byte matches between two strings be optimized using SIMD?](https://stackoverflow.com/q/15598997) shows an example. Counting mismatches is just `len - count_matches()`, so you just count the thing that's most efficient inside the loop: subtracting `0` or `-1` to count matches. [How to count character occurrences using SIMD](https://stackoverflow.com/q/54541129) has an AVX2 version. (Comparing against `set1_epi8(key)` instead of loading from two arrays, otherwise the same problem.) – Peter Cordes Jun 10 '23 at 15:55
@PeterCordes FYI there is [cross-posted version of this question](https://codereview.stackexchange.com/q/285452/36018), you can weigh in there too – harold Jun 11 '23 at 00:46

How to optimize array iteration in C for intel processors

1 Answers1