13

I deal with image processing. I need to divide 16-bit integer SSE vector by 255.

I can't use shift operator like _mm_srli_epi16(), because 255 is not a multiple of power of 2.

I know of course that it is possible convert integer to float, perform division and then back conversion to integer.

But might somebody knows another solution...

ErmIg
  • 3,980
  • 1
  • 27
  • 40
Claudio
  • 133
  • 1
  • 7
  • 1
    Does [this](http://stackoverflow.com/q/16822757/3959454) help? – Anton Savin Feb 09 '16 at 06:43
  • 1
    Typically you would divide by 256 (with rounding rather than truncation) - is there some reason why it has to be 255 and not 256 ? – Paul R Feb 09 '16 at 07:25
  • 1
    Maybe [this](http://stackoverflow.com/questions/31575833/fastest-method-of-vectorized-integer-division-by-non-constant-divisor) question is interesting for you too. When you have to deal with non-constant integer division in future, conversion to float is a fast option as well. – Youka Feb 09 '16 at 07:48
  • 1
    @Paul-R If I will divide by 256 it cause decreasing of average brightness when this operation will be performed many times. – Claudio Feb 09 '16 at 12:50
  • @Claudio: it's hard to say without seeing the rest of your code, but if you're working with the full 16 bit range and then scaling back down to 8 bits then the scale factor should be 256, otherwise if you take e.g. full scale uint16_t = 65535 and divide that by 255 you get 257. I guess you know what you're doing - I'm just puzzled... – Paul R Feb 09 '16 at 13:57
  • Updated my answer with an unsigned version which has better throughput than the accepted answer. I didn't realize before that you were using unsigned, but I think you are based on `srli` rather than `srai` – Peter Cordes Jul 31 '18 at 03:44

5 Answers5

16

There is an integer approximation of division by 255:

inline int DivideBy255(int value)
{
    return (value + 1 + (value >> 8)) >> 8;
}

So with using of SSE2 it will look like:

inline __m128i DivideI16By255(__m128i value)
{
    return _mm_srli_epi16(_mm_add_epi16(
        _mm_add_epi16(value, _mm_set1_epi16(1)), _mm_srli_epi16(value, 8)), 8);
}

For AVX2:

inline __m256i DivideI16By255(__m256i value)
{
    return _mm256_srli_epi16(_mm256_add_epi16(
        _mm256_add_epi16(value, _mm256_set1_epi16(1)), _mm256_srli_epi16(value, 8)), 8);
}

For Altivec (Power):

typedef __vector int16_t v128_s16;
const v128_s16 K16_0001 = {1, 1, 1, 1, 1, 1, 1, 1};
const v128_s16 K16_0008 = {8, 8, 8, 8, 8, 8, 8, 8};

inline v128_s16 DivideBy255(v128_s16 value)
{
    return vec_sr(vec_add(vec_add(value, K16_0001), vec_sr(value, K16_0008)), K16_0008);
}

For NEON (ARM):

inline int16x8_t DivideI16By255(int16x8_t value)
{
    return vshrq_n_s16(vaddq_s16(
        vaddq_s16(value, vdupq_n_s16(1)), vshrq_n_s16(value, 8)), 8);
}
ErmIg
  • 3,980
  • 1
  • 27
  • 40
  • 1
    This is wrong for `value == 65535` and for all negative numbers (so works neither for signed nor unsigned 16-bit integers) – Anton Savin Feb 09 '16 at 06:50
  • 1
    I know that it perfectly works for alpha blending. But I don't exclude errors in another cases. – ErmIg Feb 09 '16 at 06:55
  • @AntonSavin: Accurate unsigned division by 255 is *more* efficient than this using `pmulhuw` for better throughput, although worse latency. My answer from a couple years ago was only looking at signed, not unsigned. (And for some reason, nobody seems to be distinguishing between the two, sometimes using `int` and sometimes using SIMD logical right shift.) – Peter Cordes Jul 31 '18 at 03:41
10

If you want an exactly correct result for all cases, follow the advice from Marc Glisse's comment on the question Anton linked: SSE integer division?

Use GNU C native vector syntax to express division of a vector by your given scalar, and see what it does on the Godbolt compiler explorer:

Unsigned division is cheap:

typedef unsigned short vec_u16 __attribute__((vector_size(16)));
vec_u16 divu255(vec_u16 x){ return x/255; }  // unsigned division

#gcc5.5 -O3 -march=haswell
divu255:
    vpmulhuw        xmm0, xmm0, XMMWORD PTR .LC3[rip]  # _mm_set1_epi16(0x8081)
    vpsrlw          xmm0, xmm0, 7
    ret

Intrinsics version:

 // UNSIGNED division with intrinsics
__m128i div255_epu16(__m128i x) {
    __m128i mulhi = _mm_mulhi_epu16(x, _mm_set1_epi16(0x8081));
    return _mm_srli_epi16(mulhi, 7);
}

At only 2 uops, this has better throughput (but worse latency) than @ermlg's answer, if you're bottlenecked on front-end throughput, or port 0 throughput on Intel CPUs. (As always, it depends on the surrounding code when you use this as part of a larger function.) http://agner.org/optimize/

Vector shift only runs on port 0 on Intel chips, so @ermlg's 2 shifts + 1 add bottlenecks on port 0. (Again depending on surrounding code). And it's 3 uops vs. 2 for this.

On Skylake, pmulhuw / pmulhw runs on ports 0 or 1, so it can run in parallel with a shift. (But on Broadwell and earlier, they run only on port 0, conflicting with shifts. So the only advantage on Intel pre-Skylake is fewer total uops for the front-end and for out-of-order execution to keep track of.) pmulhuw has 5 cycle latency on Intel, vs. 1 for shifts, but OoO exec can typically hide a few cycles more latency when you can save uops for more throughput.

Ryzen also only runs pmulhuw on its P0, but shifts on P2, so it's excellent for this.

But signed integer division rounding semantics doesn't match shifts

typedef short vec_s16 __attribute__((vector_size(16)));

vec_s16 div255(vec_s16 x){ return x/255; }  // signed division

    ; function arg x starts in xmm0
    vpmulhw xmm1, xmm0, XMMWORD PTR .LC3[rip]  ; a vector of set1(0x8081)
    vpaddw  xmm1, xmm1, xmm0
    vpsraw  xmm0, xmm0, 15       ; 0 or -1 according to the sign bit of x
    vpsraw  xmm1, xmm1, 7        ; shift the mulhi-and-add result
    vpsubw  xmm0, xmm1, xmm0     ; result += (x<0)

.LC3:
        .value  -32639
        .value  -32639
        ; repeated

At the risk of bloating the answer, here it is again with intrinsics:

// SIGNED division
__m128i div255_epi16(__m128i x) {
    __m128i tmp = _mm_mulhi_epi16(x, _mm_set1_epi16(0x8081));
    tmp = _mm_add_epi16(tmp, x);  // There's no integer FMA that's usable here
    x   = _mm_srai_epi16(x, 15);  // broadcast the sign bit
    tmp = _mm_srai_epi16(tmp, 7);
    return _mm_sub_epi16(tmp, x);
}

In the godbolt output, note that gcc is smart enough to use the same 16B constant in memory for the set1 and for the one it generated itself for div255. AFAIK, this works like string-constant merging.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
3

GCC optimizes x/255 with x is unsigned short to DWORD(x * 0x8081) >> 0x17 which can further be simplified to HWORD(x * 0x8081) >> 7 and finally HWORD((x << 15) + (x << 7) + x) >> 7.

SIMD macros can look like this:

#define MMX_DIV255_U16(x) _mm_srli_pi16(_mm_mulhi_pu16(x, _mm_set1_pi16((short)0x8081)), 7)
#define SSE2_DIV255_U16(x) _mm_srli_epi16(_mm_mulhi_epu16(x, _mm_set1_epi16((short)0x8081)), 7)
#define AVX2_DIV255_U16(x) _mm256_srli_epi16(_mm256_mulhi_epu16(x, _mm256_set1_epi16((short)0x8081)), 7)
Youka
  • 2,646
  • 21
  • 33
3

Accurate version:

#define div_255_fast(x)    (((x) + (((x) + 257) >> 8)) >> 8)

When x is in the range of [0, 65536], the ERROR is 0. It is twice as faster as x/255

enter image description here

http://quick-bench.com/t3Y2-b4isYIwnKwMaPQi3n9dmtQ

SIMD version:

// (x + ((x + 257) >> 8)) >> 8
static inline __m128i _mm_fast_div_255_epu16(__m128i x) {
    return _mm_srli_epi16(_mm_adds_epu16(x, 
        _mm_srli_epi16(_mm_adds_epu16(x, _mm_set1_epi16(0x0101)), 8)), 8);
}

For a positive x above 65535, here is another version:

static inline int32_t fast_div_255_any (int32_t n) {
    uint64_t M = (((uint64_t)1) << 40) / 255 + 1;  // "1/255" in 24.40 fixed point number
    return (M * n) >> 40;   // fixed point multiply: n * (1/255)
}

More expansive (requires 64-bits mul), but still faster than div instruction.

skywind3000
  • 408
  • 5
  • 9
  • What's `_const_simd_mask_8x0101`? If that's a global variable, compilers typically do *worse* with that than with `_mm_set1_epi16(0x0101)`. – Peter Cordes May 29 '20 at 09:58
  • Also, I think you mean `[0, 65535]`, or `[0, 65536)`, because `65536` = 2^16 which doesn't fit in an epu16 – Peter Cordes May 29 '20 at 10:00
  • @PeterCordes, updated and [0, 65536] is for the c version. – skywind3000 May 29 '20 at 21:44
  • In plain C for scalar, you don't need special tricks. Compilers already know how to use a multiplicative inverse: [Why does GCC use multiplication by a strange number in implementing integer division?](https://stackoverflow.com/q/41183935). Just use `unsigned` and you get efficient asm, or asm like your signed version that works for any `int32_t`: https://godbolt.org/z/cTzYzb. Your `fast_div_255_any` appears to handle the sign bit so it actually works for any x, not just positive, right? I guess it's a useful step in turning this into SIMD with `_mm_mul_epu32` (`pmuludq`) – Peter Cordes May 29 '20 at 22:04
  • @PeterCordes, div_255_fast is much cheaper than fast_div_255_any (which gcc may use). It is still faster than the compiler if your number is between 0 and 65536. – skywind3000 May 29 '20 at 22:30
  • I only meant to comment on your `fast_div_255_any` which handles any `int32_t`. But since you bring it up `div_255_fast` isn't *much* cheaper than normal `/255` for `uint32_t`, especially after inlining into a loop so it doesn't need so zero-extension is already done and the constant can be reused. https://godbolt.org/z/HiNzD8 shows the stand-alone versions. Modern x86 runs `imul r64, r64` as a single uop with 3 cycle latency, and it's just one `imul` and one `shr`, so a total critical path latency of 4 cycles. vs. also 4 cycles and 4 uops instead of 2 for your `div_255_fast`. – Peter Cordes May 29 '20 at 22:36
  • It's significantly faster if you're compiling for Silvermont or Atom, though, where 64-bit `imul` is slow. Or if you do something silly like use `int32_t x` for `x/255` and force the compiler to handle signed division. – Peter Cordes May 29 '20 at 22:38
  • @PeterCordes `div_255_fast` is twice as faster as `x/255`, see the benchmark http://quick-bench.com/t3Y2-b4isYIwnKwMaPQi3n9dmtQ – skywind3000 Jun 01 '20 at 03:50
  • I was talking about for scalar. Your benchmark auto-vectorized both, with inefficient shuffling around `pmuludq` for the `x/255` version instead of just keeping 2 vector accumulators. Sure, if that's what you're doing and you want to rely on auto-vectorization, then sure. I guess most of the use-cases for division by 255 are for pixel data and tend to be data-parallel, but your answer labeled them as scalar vs. SIMD, not as auto-vectorization vs. manual vectorization. – Peter Cordes Jun 01 '20 at 03:59
  • And if you want auto-vectorization, the version in my answer using `unsigned short` works nicely and should be even more efficient if your data is actually packed 16-bit, not 32-bit elements with values in the 0..65535 range. On some CPUs without as good hardware SIMD-integer multiply, your manually-vectorized version with saturating add might be good. – Peter Cordes Jun 01 '20 at 04:01
2

Out of curiosity (and if performance is an issue), here is the accuracy of using (val + offset) >> 8 as a substitute for (val / 255) for all 16 bit values up to 255*255 (for example when blending two 8bit values using an 8bit blend factor):

(avrg signed error / avrg abs error / max abs error)
offset    0:  0.49805 / 0.49805 / 1  (just shifting, no offset)
offset 0x7F:  0.00197 / 0.24806 / 1
offest 0x80: -0.00194 / 0.24806 / 1

All other offsets produce both larger signed and avrg errors. So if you can live with an average error of less than 0.25 than use offset+shifting for a small speed increase

// approximate division by 255 for packs of 8 times 16bit values in vals_packed
__m128i offset = _mm_set1_epi16(0x80); // constant
__m128i vals_packed_offest = _mm_add_epi16( vals_packed, offset );
__m128i result_packed = _mm_srli_epi16( vals_packed_offest , 8 );
Oliver Zendel
  • 2,695
  • 34
  • 29
  • With a saturating add, you could use this without disastrous results for values even closer to the wrap-around points. `_mm_adds_epi16` or [`epu16`](http://felixcloutier.com/x86/PADDUSB:PADDUSW.html) for signed/unsigned saturation. Are you sure you want to use `srli` (logical right shift) instead of arithmetic right shift, `srai`? Or are you doing this for unsigned? – Peter Cordes Jul 31 '18 at 00:35