The main improvement possible for your technique is a sliding window. Integer addition has an inverse, subtraction. This makes it possible to drop elements out of the sum and add new ones, instead of re-computing from scratch. (Terryfkjc's answer demonstrates this).
As Terryfkjc commented on his answer, there's thread-level parallelism to be had. You can have different threads check different parts of the array. If the size of the sliding window is more than half the size of the array, then summing the first n
should be threaded. Dividing the work between threads is most tricky when n
is about half the size of the array. Much bigger, and it can't slide as far. Much smaller, and most of the work is just the sliding.
If you're targeting a CPU with vector instructions, we can vectorize the initial sum of the first w
(window size / width) elements easily enough with SIMD. For example, using x86 SSE (C/C++ intrinsics):
#include <immintrin.h>
const __m128i *array_vec = (__m128i*)array;
__m128i sums = _mm_loadu_si128(array_vec); // or start with _mm_setzero_si128()
// careful to get the loop start/end and increment correct,
// whether you count by vectors (i++) or by elements (i+=4)
// You may need a scalar cleanup at the end if window width isn't a multiple of the vector width
for (int i=1; i < width/4 ; i+=1) {
sums = _mm_add_epi32(sums, _mm_loadu_si128(array_vec+i));
// if you know the input is aligned, and you're going to compile without AVX/AVX2, then do:
// sums = _mm_add_epi32(sums, array_vec[i]);
}
__m128i hisums = _mm_bsrli_si128(sums, 8); // get the high 2 elements into the low 2 elements of a separate vector
sums = _mm_add_epi32(sums, hisums); // one step of horizontal merging.
hisums = _mm_bsrli_si128(sums, 4); // actually, pshufd would be faster for this, since it doesn't need a mov to preserve sums.
sums = _mm_add_epi32(sums, hisums);
int window_sum = _mm_cvtsi128_si32(sums); // get the low 32bit element
I don't think it's possible to vectorize the sliding-window part. We can't usefully slide 4 separate windows, because every window (vector element) needs to see every array element.
However, if we're interested in 4/8/16 different window sizes (preferably consecutive), then we can do that with vectors. So, at each loop iteration, we have a vector with the current sliding window sum. With SSE4.1 pmaxsd
(for signed 32bit ints), we can do
window_max = _mm_max_epi32(window_max, window_sums);
The sliding window operation becomes: add the { a[i], a[i+1], a[i+2], a[i+3] }
, while removing the same array element from all 4 vector elements. (Alternatively, broadcast the element to add, and subtract a vector of different elements.)
__m128i add_elements = _mm_loadu_si128((__m128i*)&array[i]); // load i, i+1, etc.
__m128i sub_element = _mm_cvtsi32_si128(array[i-width-1]);
sub_element = _mm_shuffle_epi32(sub_element, 0); // broadcast the low element of the vector to the other positions.
// AVX2: use _mm_broadcastd_epi32
// AVX512 has a broadcast mode for memory references that can be used with most instructions. IDK how to use it with intrinsics.
__m128i diff = _mm_sub_epi32(add_elements, sub_element);
window_sums = _mm_add_epi32(window_sums, diff);
Getting the loop start/end conditions right, and accounting for the last few elements correctly, is always the challenge with vectors.
See https://stackoverflow.com/tags/x86/info for more about how to use vectors on x86.