3

I wrote an algorithm to get the biggest difference between two elements in an std::vector where the bigger of the two values must be at a higher index than the lower value.

    unsigned short int min = input.front();
    unsigned short res = 0;

    for (size_t i = 1; i < input.size(); ++i)
    {
        if (input[i] <= min)
        {
            min = input[i];
            continue;
        }

        int dif = input[i] - min;
        res = dif > res ? dif : res;
    }

    return res != 0 ? res : -1;

Is it possible to optimize this algorithm using SIMD? I'm new to SIMD and so far I've been unsuccessful with this one

Yum
  • 81
  • 3

1 Answers1

5

You didn't specify any particular architecture so I'll keep this mostly architecture neutral with an algorithm described in English. But it requires a SIMD ISA that can efficiently branch on SIMD compare results to check a usually-true condition, like x86 but not really ARM NEON.

This won't work well for NEON because it doesn't have a movemask equivalent, and SIMD -> integer causes stalls on many ARM microarchitectures.


The normal case while looping over the array is that an element, or a whole SIMD vector of elements, is not a new min, and not diff candidate. We can quickly fly through those elements, only slowing down to get the details right when there's a new min. This is like a SIMD strlen or SIMD memcmp, except instead of stopping at the first search hit, we just go scalar for one block and then resume.


For each vector v[0..7] of the input array (assuming 8 int16_t elements per vector (16 bytes), but that's arbitrary):

  • SIMD compare vmin > v[0..7], and check for any elements being true. (e.g. x86 _mm_cmpgt_epi16 / if(_mm_movemask_epi8(cmp) != 0)) If there's a new min somewhere, we have a special case: the old min applies to some elements, but the new min applies to others. And it's possible there are multiple new-min updates within the vector, and new-diff candidates at any of those points.

    So handle this vector with scalar code (updating a scalar diff which doesn't need to be in sync with the vector diffmax because we don't need position).

    Broadcast the final min to vmin when you're done. Or do a SIMD horizontal min so out-of-order execution of later SIMD iterations can get started without waiting for a vmin from scalar. Should work well if the scalar code is branchless, so there are no mispredicts in the scalar code that cause later vector work to be thrown out.

    As an alternative, a SIMD prefix-sum type of thing (actually prefix-min) could produce a vmin where every element is the min up to that point. (parallel prefix (cumulative) sum with SSE). You could always do this to avoid any branching, but if new-min candidates are rare then it's expensive. Still, it could be viable on ARM NEON where branching is hard.

  • If there's no new min, SIMD packed max diffmax[0..7] = max(diffmax[0..7], v[0..7]-vmin). (Use saturating subtraction so you don't get wrap-around to a large unsigned difference, if you're using unsigned max to handle the full range.)

At the end of the loop, do a SIMD horizontal max of the diffmax vector. Notice that since we don't need the position of the max-difference, we don't need to update all elements inside the loop when one finds a new candidate. We don't even need to keep the scalar special-case diffmax and SIMD vdiffmax in sync with each other, just check at the end to take the max of the scalar and SIMD max diffs.


SIMD min/max is basically the same as a horizontal sum, except you use packed-max instead of packed-add. For x86, see Fastest way to do horizontal float vector sum on x86.

Or on x86 with SSE4.1 for 16-bit integer elements, phminposuw / _mm_minpos_epu16 can be used for min or max, signed or unsigned, with appropriate tweaks to the input. max = -min(-diffmax). You can treat diffmax as unsigned because it's known to be non-negative, but Horizontal minimum and maximum using SSE shows how to flip the sign bit to range-shift signed to unsigned and back.


We probably get a branch mispredict every time we find a new min candidate, or else we're finding new min candidates too often for this to be efficient.

If new min candidates are expected frequently, using shorter vectors could be good. Or on discovering there's a new-min in a current vector, then use narrower vectors to only go scalar over fewer elements. On x86, you might use bsf (bit-scan forward) to find which element had the first new-min. That gives your scalar code a data dependency on the vector compare-mask, but if the branch to it was mispredicted then the compare-mask will be ready. Otherwise if branch-prediction can somehow find a pattern in which vectors need the scalar fallback, prediction+speculative execution will break that data dependency.


Unfinished / broken (by me) example adapted from @harold's deleted answer of a fully branchless version that constructs a vector of min-up-to-that-element on the fly, for x86 SSE2.

(@harold wrote it with suffix-max instead of min, which is I think why he deleted it. I partially converted it from max to min.)

A branchless intrinsics version for x86 could look something like this. But branchy is probably better unless you expect some kind of slope or trend that makes new min values frequent.

// BROKEN, see FIXME comments.
// converted from @harold's suffix-max version

int broken_unfinished_maxDiffSSE(const std::vector<uint16_t> &input) {
    const uint16_t *ptr = input.data();

    // construct suffix-min
    // find max-diff at the same time
    __m128i min = _mm_set_epi32(-1);
    __m128i maxdiff = _mm_setzero_si128();

    size_t i = input.size();
    for (; i >= 8; i -= 8) {
        __m128i data = _mm_loadu_si128((const __m128i*)(ptr + i - 8));

   // FIXME: need to shift in 0xFFFF, not 0, for min.
   // or keep the old data, maybe with _mm_alignr_epi8
        __m128i d = data;
        // link with suffix
        d = _mm_min_epu16(d, _mm_slli_si128(max, 14));
        // do suffix-min within block.
        d = _mm_min_epu16(d, _mm_srli_si128(d, 2));
        d = _mm_min_epu16(d, _mm_shuffle_epi32(d, 0xFA));
        d = _mm_min_epu16(d, _mm_shuffle_epi32(d, 0xEE));
        max = d;

        // update max-diff
        __m128i diff = _mm_subs_epu16(data, min);  // with saturation to 0
        maxdiff = _mm_max_epu16(maxdiff, diff);
    }

    // horizontal max
    maxdiff = _mm_max_epu16(maxdiff, _mm_srli_si128(maxdiff, 2));
    maxdiff = _mm_max_epu16(maxdiff, _mm_shuffle_epi32(maxdiff, 0xFA));
    maxdiff = _mm_max_epu16(maxdiff, _mm_shuffle_epi32(maxdiff, 0xEE));
    int res = _mm_cvtsi128_si32(maxdiff) & 0xFFFF;

    unsigned scalarmin = _mm_extract_epi16(min, 7);  // last element of last vector
    for (; i != 0; i--) {
        scalarmin = std::min(scalarmin, ptr[i - 1]);
        res = std::max(res, ptr[i - 1] - scalarmin);
    }

    return res != 0 ? res : -1;
}

We could replace the scalar cleanup with a final unaligned vector, if we handle the overlap between the last full vector min.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    That way I did that "link with suffix" step that got broken by switching to `min` was a bad approach anyway, really should a broadcast and then putting it in *after* the in-block stuff to make the loop carried dependency faster – harold Sep 25 '18 at 06:10
  • @harold: oh good point! This doesn't even bottleneck on shuffles, but rather on the shuffle->max->... dep chain. I wanted to include some kind of code as an example of what intrinsics look like, for the OP to google more about. So I'm going to leave that not-great code block there with warning comments. If you get around to improving/fixing it, feel free to edit my answer, or better undelete your own. – Peter Cordes Sep 25 '18 at 07:03