x -= x>>7
is trivial to implement with SSE2, using a constant shift count for efficiency. This compiles to 2 instructions if AVX is available, otherwise a movdqa
is needed to copy v
before a destructive right-shift.
__m128i downscale(__m128i v){
__m128i dec = _mm_srai_epi16(v, 7);
return _mm_sub_epi16(v, dec);
}
GCC even auto-vectorizes it (Godbolt).
void foo(short *__restrict a) {
for (int i=0 ; i<10240 ; i++) {
a[i] -= a[i]>>7; // inner loop uses the same psraw / psubw
}
}
Unlike float
, fixed-point has constant absolute precision over the full range, not constant relative precision. So for small positive numbers, v>>7
will be zero and your decrement will stall. (Negative inputs underflow to -1
, because arithmetic right shift rounds towards -infinity.)
If small inputs where the shift can underflow to 0, you might want to OR with _mm_set1_epi16(1)
to make sure the decrement is non-zero. Negligible effect on large-ish inputs. However, that will eventually make a downscale chain go from 0 to -1. (And then back up to 0, because -1 | 1 == -1
in 2's complement).
__m128i downscale_nonzero(__m128i v){
__m128i dec = _mm_srai_epi16(v, 7);
dec = _mm_or_si128(dec, _mm_set1_epi16(1));
return _mm_sub_epi16(v, dec);
}
If starting negative, the sequence would be -large, logarithmic until -128, linear until -4, -3, -2, -1, 0, -1, 0, -1, ...
Your code got all-zeros because _mm_sra_epi16
uses the low 64 bits of the 2nd source vector as a 64-bit shift count that applies to all elements. Read the manual. So you shifted all the bits out of each 16-bit element.
It's not idiotic, but per-element shift counts require AVX2 (for 32/64-bit elements) or AVX512BW for _mm_srav_epi16
or 64-bit arithmetic right shifts, which would make sense for the way you're trying to use it. (But the shift count is unsigned, so -1
also going to shift out all the bits).
Indeed, that instruction should be named _mm_sra1_epi16()
Yup, that would make sense. But remember that when these were named, AVX2 _mm_srav_*
didn't exist yet. Also, that specific name would not be ideal because 1
and i
are not the most visually distinct. (i
for immediate, for the psraw xmm1, imm16
form instead of the psraw xmm1, xmm2/m128
form of the asm instruction: http://felixcloutier.com/x86/PSRAW:PSRAD:PSRAQ.html).
The other way it makes sense is that the MMX/SSE2 asm instruction has two forms: immediate (with the same count for all elements of course), and vector. Instead of forcing you to broadcast the count to all element, the vector version takes the scalar count in the bottom of a vector register. I think the intended use-case is after a movd xmm0, eax
or something.
If you need per-element-variable shift counts without AVX512, see various Q&As about emulating it, e.g. Shifting 4 integers right by different values SIMD.
Some of the workarounds use multiplies by powers of 2 for variable left-shift, and then a right shift to put the data where needed. (But you need to somehow get the 1<<n
SIMD vector prepared, so this works if the same set of counts is reused for many vectors, or especially if it's a compile-time constant).
With 16-bit elements, you can use just one _mm_mulhi_epi16
to do runtime-variable right shift counts with no precision loss or range limits. mulhi(x*y)
is exactly like (x*(int)y) >> 16
, so you can use y=1<<14
to right shift by 16-14 = 2 in that element.