You didn't specify any particular architecture so I'll keep this mostly architecture neutral with an algorithm described in English. But it requires a SIMD ISA that can efficiently branch on SIMD compare results to check a usually-true condition, like x86 but not really ARM NEON.
This won't work well for NEON because it doesn't have a movemask equivalent, and SIMD -> integer causes stalls on many ARM microarchitectures.
The normal case while looping over the array is that an element, or a whole SIMD vector of elements, is not a new min
, and not diff
candidate. We can quickly fly through those elements, only slowing down to get the details right when there's a new min
. This is like a SIMD strlen
or SIMD memcmp
, except instead of stopping at the first search hit, we just go scalar for one block and then resume.
For each vector v[0..7]
of the input array (assuming 8 int16_t
elements per vector (16 bytes), but that's arbitrary):
SIMD compare vmin > v[0..7]
, and check for any elements being true. (e.g. x86 _mm_cmpgt_epi16
/ if(_mm_movemask_epi8(cmp) != 0)
) If there's a new min
somewhere, we have a special case: the old min applies to some elements, but the new min applies to others. And it's possible there are multiple new-min updates within the vector, and new-diff candidates at any of those points.
So handle this vector with scalar code (updating a scalar diff
which doesn't need to be in sync with the vector diffmax
because we don't need position).
Broadcast the final min
to vmin
when you're done. Or do a SIMD horizontal min
so out-of-order execution of later SIMD iterations can get started without waiting for a vmin
from scalar. Should work well if the scalar code is branchless, so there are no mispredicts in the scalar code that cause later vector work to be thrown out.
As an alternative, a SIMD prefix-sum type of thing (actually prefix-min) could produce a vmin
where every element is the min up to that point. (parallel prefix (cumulative) sum with SSE). You could always do this to avoid any branching, but if new-min candidates are rare then it's expensive. Still, it could be viable on ARM NEON where branching is hard.
If there's no new min, SIMD packed max diffmax[0..7] = max(diffmax[0..7], v[0..7]-vmin)
. (Use saturating subtraction so you don't get wrap-around to a large unsigned difference, if you're using unsigned max to handle the full range.)
At the end of the loop, do a SIMD horizontal max of the diffmax
vector. Notice that since we don't need the position of the max-difference, we don't need to update all elements inside the loop when one finds a new candidate. We don't even need to keep the scalar special-case diffmax
and SIMD vdiffmax
in sync with each other, just check at the end to take the max of the scalar and SIMD max diffs.
SIMD min/max is basically the same as a horizontal sum, except you use packed-max instead of packed-add. For x86, see Fastest way to do horizontal float vector sum on x86.
Or on x86 with SSE4.1 for 16-bit integer elements, phminposuw
/ _mm_minpos_epu16
can be used for min or max, signed or unsigned, with appropriate tweaks to the input. max = -min(-diffmax)
. You can treat diffmax as unsigned because it's known to be non-negative, but Horizontal minimum and maximum using SSE shows how to flip the sign bit to range-shift signed to unsigned and back.
We probably get a branch mispredict every time we find a new min
candidate, or else we're finding new min
candidates too often for this to be efficient.
If new min
candidates are expected frequently, using shorter vectors could be good. Or on discovering there's a new-min
in a current vector, then use narrower vectors to only go scalar over fewer elements. On x86, you might use bsf
(bit-scan forward) to find which element had the first new-min. That gives your scalar code a data dependency on the vector compare-mask, but if the branch to it was mispredicted then the compare-mask will be ready. Otherwise if branch-prediction can somehow find a pattern in which vectors need the scalar fallback, prediction+speculative execution will break that data dependency.
Unfinished / broken (by me) example adapted from @harold's deleted answer of a fully branchless version that constructs a vector of min-up-to-that-element on the fly, for x86 SSE2.
(@harold wrote it with suffix-max instead of min, which is I think why he deleted it. I partially converted it from max to min.)
A branchless intrinsics version for x86 could look something like this. But branchy is probably better unless you expect some kind of slope or trend that makes new min
values frequent.
// BROKEN, see FIXME comments.
// converted from @harold's suffix-max version
int broken_unfinished_maxDiffSSE(const std::vector<uint16_t> &input) {
const uint16_t *ptr = input.data();
// construct suffix-min
// find max-diff at the same time
__m128i min = _mm_set_epi32(-1);
__m128i maxdiff = _mm_setzero_si128();
size_t i = input.size();
for (; i >= 8; i -= 8) {
__m128i data = _mm_loadu_si128((const __m128i*)(ptr + i - 8));
// FIXME: need to shift in 0xFFFF, not 0, for min.
// or keep the old data, maybe with _mm_alignr_epi8
__m128i d = data;
// link with suffix
d = _mm_min_epu16(d, _mm_slli_si128(max, 14));
// do suffix-min within block.
d = _mm_min_epu16(d, _mm_srli_si128(d, 2));
d = _mm_min_epu16(d, _mm_shuffle_epi32(d, 0xFA));
d = _mm_min_epu16(d, _mm_shuffle_epi32(d, 0xEE));
max = d;
// update max-diff
__m128i diff = _mm_subs_epu16(data, min); // with saturation to 0
maxdiff = _mm_max_epu16(maxdiff, diff);
}
// horizontal max
maxdiff = _mm_max_epu16(maxdiff, _mm_srli_si128(maxdiff, 2));
maxdiff = _mm_max_epu16(maxdiff, _mm_shuffle_epi32(maxdiff, 0xFA));
maxdiff = _mm_max_epu16(maxdiff, _mm_shuffle_epi32(maxdiff, 0xEE));
int res = _mm_cvtsi128_si32(maxdiff) & 0xFFFF;
unsigned scalarmin = _mm_extract_epi16(min, 7); // last element of last vector
for (; i != 0; i--) {
scalarmin = std::min(scalarmin, ptr[i - 1]);
res = std::max(res, ptr[i - 1] - scalarmin);
}
return res != 0 ? res : -1;
}
We could replace the scalar cleanup with a final unaligned vector, if we handle the overlap between the last full vector min
.