How to implement decay towards zero in signed fixed point math, in sse?

Question

There are many decay-like physical events (for example body friction or charge leak), that are usually modelled in iterators like x' = x * 0.99, which is usually very easy to write in floating point arithmetics.

However, i have a demand to do this in 16-bit "8.8" signed fixed point manner, in sse. For efficient implementation on typical ALU mentioned formula can be rewritten as x = x - x/128; or x = x - (x>>7) where >> is "arithmetic", sign-extending right shift.

And i stuck here, because _mm_sra_epi16() produces totally counterintuitive behaviour, which is easily verifiable by following example:

#include <cstdint>
#include <iostream>
#include <emmintrin.h>

using namespace std;

int main(int argc, char** argv) {
    cout << "required: ";
    for (int i = -1; i < 7; ++i) {
        cout << hex << (0x7fff >> i) << ", ";
    }
    cout << endl;
    cout << "produced: ";
    __m128i a = _mm_set1_epi16(0x7fff);
    __m128i b = _mm_set_epi16(-1, 0, 1, 2, 3, 4, 5, 6);
    auto c = _mm_sra_epi16(a, b);
    for (auto i = 0; i < 8; ++i) {
        cout << hex << c.m128i_i16[i] << ", ";
    }
    cout << endl;
    return 0;
}

Output would be as follows:

required: 0, 7fff, 3fff, 1fff, fff, 7ff, 3ff, 1ff,
produced: 0, 0, 0, 0, 0, 0, 0, 0,

It only applies first shift to all, like it is actually _mm_sra1_epi16 function, accidentely named sra and given __m128i second argument bu a funny clause for no reason. So this cannot be used in SSE.

On other hand, i heard that division algorithm is enormously complex, thus _mm_div_epi16 is absent in SSE and also cannot be used. What to do and how to implement/vectorize that popular "decay" technique?

`_mm_sra_epi16` uses the low 64 bits of the 2nd source vector as a 64-bit shift count that applies to all elements. It's not idiotic, but per-element shift counts require AVX2 (for 32/64-bit elements) or AVX512BW for [`_mm_srav_epi16`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,AVX_512,SVML,Other&expand=1276,5146,5146,5048,5048&text=mm_srav) or 64-bit arithmetic right shifts, which would make sense for the way you're trying to use it. (But the shift count is unsigned, so `-1` also going to shift out all the bits). — Peter Cordes, Oct 03 '18 at 15:44
Is `_mm_sra_epi16()` "idiotic" because it's not a variable shift? It shifts all elements by the same amount based on the lowest lane. If you want a variable shift with 16-bit width, you need AVX512. — Mysticial, Oct 03 '18 at 15:44
@Mysticial Why `_mm_add_epi16()` does not add same amount "based on lowest lane" then? I think its inadequate behaviour and is violation of consistent interface, blowing up porting/vectorization. Indeed, that instruction should be named `_mm_sra1_epi16()`. — xakepp35, Oct 03 '18 at 15:55
Why do you think you need different shift counts for each element? — Peter Cordes, Oct 03 '18 at 16:08

Peter Cordes · Accepted Answer · 2018-10-03T17:08:27.663

x -= x>>7 is trivial to implement with SSE2, using a constant shift count for efficiency. This compiles to 2 instructions if AVX is available, otherwise a movdqa is needed to copy v before a destructive right-shift.

__m128i downscale(__m128i v){
    __m128i dec = _mm_srai_epi16(v, 7);
    return _mm_sub_epi16(v, dec);
}

GCC even auto-vectorizes it (Godbolt).

void foo(short *__restrict a) {
    for (int i=0 ; i<10240 ; i++) {
        a[i] -= a[i]>>7;  // inner loop uses the same psraw / psubw
    }
}

Unlike float, fixed-point has constant absolute precision over the full range, not constant relative precision. So for small positive numbers, v>>7 will be zero and your decrement will stall. (Negative inputs underflow to -1, because arithmetic right shift rounds towards -infinity.)

If small inputs where the shift can underflow to 0, you might want to OR with _mm_set1_epi16(1) to make sure the decrement is non-zero. Negligible effect on large-ish inputs. However, that will eventually make a downscale chain go from 0 to -1. (And then back up to 0, because -1 | 1 == -1 in 2's complement).

__m128i downscale_nonzero(__m128i v){
    __m128i dec = _mm_srai_epi16(v, 7);
    dec = _mm_or_si128(dec, _mm_set1_epi16(1));
    return _mm_sub_epi16(v, dec);
}

If starting negative, the sequence would be -large, logarithmic until -128, linear until -4, -3, -2, -1, 0, -1, 0, -1, ...

Your code got all-zeros because _mm_sra_epi16 uses the low 64 bits of the 2nd source vector as a 64-bit shift count that applies to all elements. Read the manual. So you shifted all the bits out of each 16-bit element.

It's not idiotic, but per-element shift counts require AVX2 (for 32/64-bit elements) or AVX512BW for _mm_srav_epi16 or 64-bit arithmetic right shifts, which would make sense for the way you're trying to use it. (But the shift count is unsigned, so -1 also going to shift out all the bits).

Indeed, that instruction should be named _mm_sra1_epi16()

Yup, that would make sense. But remember that when these were named, AVX2 _mm_srav_* didn't exist yet. Also, that specific name would not be ideal because 1 and i are not the most visually distinct. (i for immediate, for the psraw xmm1, imm16 form instead of the psraw xmm1, xmm2/m128 form of the asm instruction: http://felixcloutier.com/x86/PSRAW:PSRAD:PSRAQ.html).

The other way it makes sense is that the MMX/SSE2 asm instruction has two forms: immediate (with the same count for all elements of course), and vector. Instead of forcing you to broadcast the count to all element, the vector version takes the scalar count in the bottom of a vector register. I think the intended use-case is after a movd xmm0, eax or something.

If you need per-element-variable shift counts without AVX512, see various Q&As about emulating it, e.g. Shifting 4 integers right by different values SIMD.

Some of the workarounds use multiplies by powers of 2 for variable left-shift, and then a right shift to put the data where needed. (But you need to somehow get the 1<<n SIMD vector prepared, so this works if the same set of counts is reused for many vectors, or especially if it's a compile-time constant).

With 16-bit elements, you can use just one _mm_mulhi_epi16 to do runtime-variable right shift counts with no precision loss or range limits. mulhi(x*y) is exactly like (x*(int)y) >> 16, so you can use y=1<<14 to right shift by 16-14 = 2 in that element.

I got you. Just saw a task `x -= x >> y` which cannot be vectorized efficiently, prior Haswell. Glad that they added instruction, but was surprised why generic `sra` was not it added along `add`, `mul`, `and`, `or` basic instructions. What was added is `sra1` which is misnamed for counterintuitiviry as `sra` and has much less use cases. — xakepp35, Oct 03 '18 at 16:46
And i thought that i am doing it wrong and i missed another decay algorithm, which is used in such circumstances. As for bits, it does not require to converge to zero, as x*=.99 shold never be zero. — xakepp35, Oct 03 '18 at 16:49
@xakepp35 You're not the only one who has noticed that the SIMD instructions are inconsistent and missing a lot of cases. Most of this has been fixed in AVX512. Late, but at least somebody in Intel was paying attention. — Mysticial, Oct 03 '18 at 16:51
AFAICT now, the majority of the "obvious but missing" functionality in AVX512 are the ones that either nobody uses or are hard to implement in hardware. So they finally got it right. — Mysticial, Oct 03 '18 at 16:52
@Mystical. Okay. AVX is fine, we can use it. Great deal, just messed the names and was very surprised that immediate mode `srai` behaves exactly like `sra`. But here is much interesting puzzle: lets take some Ivy Bridge CPU, 3770k, which is a top model just before Haswell. Oh, i even have one historical here, by hand! How would you implement task in question title under such circumstances? — xakepp35, Oct 03 '18 at 17:07
@xakepp35: What are you talking about? My answer only uses SSE2, so it works on any x86-64, and 32-bit CPUs back as far as Pentium 4. You don't need `srav`, just `srai`. It's *only* variable-shifts that need AVX2. — Peter Cordes, Oct 03 '18 at 17:09
I solved! `_mm_sub_epi16(x, _mm_srai_epi16(x, a))` is **almost** equal to `_mm_slli_epi16(_mm_mulhi_epi16(x, b), 1)`, where `b = (1<<15) - (1<<(15-a))`.. And almost - is because of `slli`. While loosing bit of precision, second case allows for finer parameter tuning. — xakepp35, Oct 03 '18 at 19:42

How to implement decay towards zero in signed fixed point math, in sse?

1 Answers1