I've been doing a task on an online judge: implement int sum(const int* array, unsigned int len)
so that it returns the array of the sum. len
can be 200,000 and this function can be called 200,000 times; and my program has to execute in under 0.9s.
Currently, my code looks like this:
#include <immintrin.h>
#include <stdio.h>
int sum(const int* array, unsigned int len) {
register int i = 8, s = 0;
__m256i sm = _mm256_loadu_si256((void *)(array));
for (; i+8 < len; i += 8) {
const __m256i x = _mm256_loadu_si256((void *)(array+i));
sm = _mm256_add_epi32(sm, x);
}
sm = _mm256_hadd_epi32(sm, sm);
sm = _mm256_hadd_epi32(sm, sm);
s = _mm256_extract_epi32(sm, 0);
s += _mm256_extract_epi32(sm, 4);
for(; i < len; ++i) s += array[i];
return s;
}
However, this code does not pass as the judge reports Time limit exceeded
.
Could anyone point out which instructions are expensive time-wise, and how to speed up my code?