1

I've been doing a task on an online judge: implement int sum(const int* array, unsigned int len) so that it returns the array of the sum. len can be 200,000 and this function can be called 200,000 times; and my program has to execute in under 0.9s.

Currently, my code looks like this:

#include <immintrin.h>
#include <stdio.h>

int sum(const int* array, unsigned int len) {
    register int i = 8, s = 0;
    __m256i sm = _mm256_loadu_si256((void *)(array));
    for (; i+8 < len; i += 8) {
        const __m256i x = _mm256_loadu_si256((void *)(array+i));
        sm = _mm256_add_epi32(sm, x);
    }
    sm = _mm256_hadd_epi32(sm, sm);
    sm = _mm256_hadd_epi32(sm, sm);
    s = _mm256_extract_epi32(sm, 0);
    s += _mm256_extract_epi32(sm, 4);
    for(; i < len; ++i) s += array[i];
    return s;
}

However, this code does not pass as the judge reports Time limit exceeded.

Could anyone point out which instructions are expensive time-wise, and how to speed up my code?

Ryan Zhang
  • 1,856
  • 9
  • 19
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/226768/discussion-on-question-by-ryan-jon-zhang-using-intel-intrinsics-to-quickly-find). – Samuel Liew Jan 03 '21 at 11:35

2 Answers2

2

Doing a quick check, it looks like most reasonably recent processors provide two load ports, and two ports for addition, so at least theoretically you should get a decent gain by unrolling two iterations of the loop (though if the data is very big, it's probably going to pretty quickly come down to the bandwidth to main memory).

As with any AVX operation, you want to assure the data you're working with is properly aligned. Older processors will fault if the data is misaligned. Newer ones will work, but you'll get a pretty serious speed penalty.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
0

Implementing @JerryCoffin's suggestions:


#include <immintrin.h>
#include <stdio.h>

int sum(const int* array, unsigned int len) {
    if(len < 60) {
        int s = 0;
        for(int i = 0; i < len; ++i) s += array[i];
        return s;
    }
    register int i = 0, s = 0;
    __m256i sm = _mm256_loadu_si256((void *)(array+i));
    __m256i sm2 = _mm256_loadu_si256((void *)(array+i+8));
    i += 16;
    for (; i+16 < len; i += 16) {
        const __m256i x = _mm256_loadu_si256((void *)(array+i));
        sm = _mm256_add_epi32(sm, x);
        const __m256i y = _mm256_loadu_si256((void *)(array+i+8));
        sm2 = _mm256_add_epi32(sm2, y);
    }
    sm = _mm256_add_epi32(sm, sm2);
    sm = _mm256_hadd_epi32(sm, sm);
    sm = _mm256_hadd_epi32(sm, sm);
    s += _mm256_extract_epi32(sm, 0);
    s += _mm256_extract_epi32(sm, 4);
    for(; i < len; ++i) s += array[i];
    return s;
}

Interestingly, because the function is called so many times, consuming the ints until the array is aligned actually takes more time than using loadu.

Ryan Zhang
  • 1,856
  • 9
  • 19
  • 1
    *because the function is called so many times* - you mean with small counts, so startup overhead would be high? Yeah, especially if data is often aligned correctly, leaving it up to HW is usually good. (Although for something where re-processing the same bytes twice is cheap, like copy-and-add into non-overlapping output, an unaligned and possibly-overlapping first vector can be good). When data isn't hot in L1d cache, AVX2-capable CPUs often absorb misalignment costs fast enough to keep up with L3 or RAM (only a couple % slower), and even mostly with L2. – Peter Cordes Jan 03 '21 at 10:50
  • 1
    If small-array performance matters, use a better horizontal sum ([Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2](https://stackoverflow.com/q/60108658)), and consider handling the tail with an unaligned vector load that you mask to avoid double-counting. (Maybe also or instead with 0..3 `__m128i` vectors to reduce the amount of scalar cleanup needed to 0..3 scalars, down from 0..15, if odd sizes are normal). Branch prediction of the cleanup code could be problematic, though; IDK if you can profile the online-judge's workload. – Peter Cordes Jan 03 '21 at 10:54
  • 1
    If you are L2 bandwidth bound you might want to check out [this](https://github.com/travisdowns/uarch-bench/wiki/How-much-bandwidth-does-the-L2-have-to-give,-anyway%3F) – Noah Jan 28 '21 at 05:51