2

What is the best way to use SIMD / assembler to subtract 2 uint16s with absolute value (max difference) and add (+=) the result to a float?

Similar to this C'ish example

c0 += fabs((float)a0 - (float)b0);  // C is Float accumulator, a+b pixels

where a and b are unsigned 16 bit words and c is a float. Only 1 word -> float conversion rather than 3.

Thee application is processing raw, 16-bit unsigned int image data on as many full, RGB pixels as is possible at once.

Possibly using AVX2/SSE4.2 on a Skylake Xeon E3-1275 v5?


5 minute comment limit?? Can't save or re-edit???

Are you sure you need float? Uint16 can't accumulate more than 1 subtraction. I want to do a neighborhood contrast calc so I need to sum at least 8 differences. There are (2D+1)^2-1 neighbors in a neighborhood with D depth. I also want to be able to square the difference which is where a uint32 can be too small. I think the floats look smoother too.

Here is a bit more background on what is already working and how I want to improve it.

To clarify, my current C code calculates the per-channel differences between a fixed home pixel and 8 or more neighbors. It has a 5 deep nested loop structure: Y-rows then X-cols for each pixel in the image (36 Million) Channels, R. G & B are loop3 Loops 4 and 5 are for the Rows and Columns of the neighborhood.

for each HOME pixel clear the R, G and B accumulators for each neighbor,
add abs(home_red - nabr_red) to red_float_accumulator same for green and blue copy accumulated values to main memory

My next step was to move the channels to level 5 and do all 3 subtractions, R, G and B simultaneously with SIMD. With 48 bits/pixel and 128 bits available per MMX register, 2 can be done at once instead of just 1.

With 512 bit registers in the AVX2 on the Skylake Xeon, 10 could be done. I am looking for a good strategy to balance complexity with performance and to learn more about these vector operations.

I need R, G and B accumulators for each "home" pixel. Then move the RGB into a "float image" with the same XY resolution as the uint16/channel RAW, RGB file. Do the same contrast calculation for each pixel.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Are you sure you need float? Keeping the whole thing in 16b fixed-point would let process twice as many elements at once (but higher precision for intermediate values is often helpful, so prob. not). 32b fixed-point is a possibility. That might not be super-helpful though, because some of the SSE instructions with multiple-step complex operations (like multiply and then right shift with rounding) are designed for working with 8bit-per-component data. – Peter Cordes Dec 24 '15 at 09:58
  • 1
    Can you please rewrite your "C'ish" example to a loop with appropriate arrays? What you wrote is ambiguous on whether `c0` is accumulator for all the values (given the `+=` operator) or wheter it is an array member and you need the diff per pixel... – Zdeněk Jelínek Dec 24 '15 at 10:15
  • Good question, is this a SAD calculation? I was assuming not. (and have an answer nearly done). – Peter Cordes Dec 24 '15 at 10:16
  • >> Are you sure you need float? – PatrickPhotog Dec 25 '15 at 01:58
  • Comments become uneditable after 5 mins. You can just delete a bad comment and post again. But actually, edited into the question is the right place for a clarification like that. – Peter Cordes Dec 25 '15 at 02:18
  • Why the heck didn't you say earlier that you wanted the differences summed in the neighbourhood? That makes a huge difference to how you approach the problem. This means you need some "horizontal" summing between elements in the same vector, not just "vertical" between elements of the same position in different vectors. For squared differences, you can still work in fixed point, by squaring and keeping only the high 16 or 24b of the product, accumulating a 32b int sum. That might be enough precision, or maybe not. Worth trying, though. – Peter Cordes Dec 25 '15 at 02:19
  • Ok, now that's a much more useful edit. You can probably get some real help with your *actual* goal now. – Peter Cordes Dec 25 '15 at 02:59
  • Is there any chance your program could work with planar RGB data? If you're doing a lot of these neighbourhood calculations, it might be best to scatter components out to R, G, and B planes once when the image is read, rather than having to deal with 48bit packed pixels. I'm not sure how bad the shuffling would get for that. Maybe unaligned loads from the start of a pixel will still let you make use of all 4, 8, or 16 elements (of float or 32bit int), with SSE4/AVX2/AVX512. Note that AVX2 is integer ops on 256b vectors. Do you want to sum differences across color components, or no? – Peter Cordes Dec 25 '15 at 03:04
  • BTW, shuffling data around to get the elements from one place in a vector together and lined up with the elements you want from another vector is the hard part of efficient vector code for stuff like this. – Peter Cordes Dec 25 '15 at 03:14

2 Answers2

4

From Justin W's answer on a question about unsigned integer absolute value: do saturating subtraction twice. One result will be zero, the other will be the absolute value. Combine them with a logical OR. Doing it this way, before unpacking 16b ints to 32b ints or floats, is cheapest.

We definitely want to subtract before unpacking from word to dword, so there's only one value to unpack. Doing a single non-saturating subtraction and then sorting it out later (e.g. by masking the sign bit to zero after converting to FP) wouldn't work. There's a range problem: a signed int16_t can't hold the full range of the difference between two uint16_t values. UINT16_MAX would wrap around and look like a difference of -1 -> abs of 1. Also, it would have the downside of requiring a sign-extending unpack.

As usual with AVX2, unpacking to a different vector width is the major headache because of the in-lane behaviour of some instructions, and the cross-lane behaviour of others.

vpunpcklwd unpacks within each 128b lane, matching up with vpackusdw. There's no single-instruction inverse for vpmovzxwd ymm, xmm, so I'll use punpck instructions on the assumption that you can have your floats arranged this way. (And with PMOVZX, you can't go directly from the upper half. You'd have to vextracti128 / vpmovzx.)

#include <immintrin.h>

// untested
__m256 add_pixdiff(__m256 c[2], __m256i a, __m256i b)
{
    __m256i ab = _mm256_subs_epu16(a, b);  // 0        or abs(a-b)
    __m256i ba = _mm256_subs_epu16(b, a);  // abs(a-b) or 0
    __m256i abs_ab_diffs = _mm256_or_si256(ab, ba);

    __m256i lo_uints = _mm256_unpacklo_epi16(abs_ab_diffs, _mm256_setzero_si256());
    __m256i hi_uints = _mm256_unpackhi_epi16(abs_ab_diffs, _mm256_setzero_si256());

    __m256 lo_floats = _mm256_cvtepi32_ps(lo_uints);
    __m256 hi_floats = _mm256_cvtepi32_ps(hi_uints);

    // use load and store intrinsics if the data might not be aligned.
    c[0] = _mm256_add_ps(c[0], lo_floats);
    c[1] = _mm256_add_ps(c[1], hi_floats);
    return c[0];
 }

It compiles exactly like you'd expect, on godbolt. For use in a loop, it's probably best to manually inline it. You don't want a stupid compiler actually using an array in memory to do call-by-reference for two float vectors. I just wrapped it in a function to keep it simple and hide the load/stores.

Note that one vector of uint16 input produces two vectors of float results. We could operate on integer vectors 128b at a time, but doing 256b at a time means we get 2 results for 3 insns (not counting the loads) + 2 unpacks, rather than 6 insns + 2 pmovzx. There's a decent amount of parallelism here: the two subtractions can happen in parallel, and there are two dependency chains of unpacks and converts. (Skylake has only one shuffle port, though, so you can't get two unpacks at once. It's an in-lane instruction with a latency of 1c, vs. vpmovzx ymm, xmm having a latency of 3, like other lane-crossing instructions. The parallelism would matter there, if you needed to keep the floats in the same order as the ints, rather than just re-packing back to the same ordering in the end.)

Vector FP add's latency increased to 4 on Skylake (from 3 on previous Intel designs), but with an increase in throughput to two per clock. (It runs them on the FMA unit. That's right, skylake's FMA is down to 4c latency).


I'm assuming you didn't actually want a SAD (one accumulator for all differences), since you wrote c0, not c in your scalar form.

If you do want a SAD (sum of absolute differences), that's much easier, and you should accumulate in the integer domain. Use the same sub both ways with unsigned saturation trick, OR them together, unpack to 32bit int, then add to a vector accumulator. Do the horizontal sum at the end.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Peter, Your solution appears to do the trick, but I will need to do a good deal more analysis before I can shoehorn this into my C program. The concept is so trivial compared to the relative complexity of the Pixel Processing code here and elsewhere I have been trying to wrap my head around: Figure out which of 2 UShorts is bigger, subtract the little one, add the result to a float. In the C prog, I promote the first uint16 to a float to avoid dealing with the sign, fabs(the_difference) and dump it on an accumulator float. Like flipping a pancake! Thx 1E9 && HaP^2y Holidays, Pat – PatrickPhotog Dec 25 '15 at 04:49
  • @PatrickPhotog: Like I commented, the trick will be getting the data shuffled correctly to set up for the subtract. This part is just one tiny component of the solution. But the basic sub both ways with saturation and combine will probably be the best building block for your algorithm. For sum of squared differences, `PMADDWD` (`_mm256_madd_epi16`) would be ideal if your differences were signed. If you right-shifted your differences by one, throwing away a low bit, your unsigned values would fit in a signed 16b int. – Peter Cordes Dec 25 '15 at 04:59
  • Also keep in mind that the more you can do locally in one pass, with data hot in cache (or better, live in registers), the fewer cache misses you'll have. This is called cache blocking. – Peter Cordes Dec 25 '15 at 05:02
0

2nd answer for vectorizing your actual problem of nested loops doing neighbour contrast calculations, not just the most basic building block that your first question was about.

The trick is going to be picking the right loop in your 5-nested loops to vectorize.

// probably convert the image to planar outside the loop
// rather than dealing with packed components on the fly,
// since you read the same src pixel many times
// (comparing it to a different dst pixel each time).

for (dst_row) {
    for (dst_col) {
        for (r, g, b) {  
            // if you convert to planar, make this the outer-most loop for cache reasons
            for (src_row) {
                for (src_col) {
                    accumulate stuff;
                }
            }
        }
        result[drow][dcol].r = sum over some range of red   src rows X cols;
        result[drow][dcol].g = sum over some range of green src rows X cols;
        result[drow][dcol].b = sum over some range of blue  src rows X cols;
    }
}

You might think you should just vectorize the accumulate stuff for a single pixel or a single component of a single pixel. That would require doing a horizontal sum (to reduce all the elements of a vector accumulator to a single scalar sum).

Better would probably be to accumulate results for 16 destination columns at once (AVX/AVX2: two 256b vectors of 8 packed single-precision floats, since you get two vectors worth of results from unpacking one vector of 16b data). FP is probably not much worse than 32bit integer, and maybe better if you're using FMA instructions.

Every iteration of the inner two loops, you're accumulating results for 16 different src pixels. (or src components, if you don't convert to planar outside the loop.) This creates a lot of locality in the inner loops. You're sliding a 256b window along 16b at a time (or 48b at a time for packed instead of planar), rather than 256b at a time, so the data is reused a lot.

Best of all, you never need to do horizontal operations at all. In each iteration of the for (dst_col) loop, your result vectors end up holding 16 elements of complete results, ready to be stored right into result[drow][dcol..dcol+16]. (Although it's at this point where you might use vpermd to shuffle across lanes and straighten out the ordering that vpunpcklwd gives, compared to vpmovzxwd. If you care about having your float array stored in strictly proper order.)

Unpacking to planar is going to make the corner cases a lot easier, and storing your float contrast results in a planar struct-of-arrays format makes it easier to vectorize something like summing the r, g, and b contrast for every pixel. addps is much faster than haddps, so you want 8 r values in one vector, and 8 g values in another vector, rather than two vectors of {rgb, rgb, x} with two wasted elements or something.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847