2

I am optimizing bottleneck code:

int sum = ........
sum = (sum >> _bitShift);

if (sum > 32000)
    sum = 32000; //if we get an overflow, saturate output
else if (sum < -32000)
    sum = -32000; //if we get an underflow, saturate output

short result = static_cast<short>(sum);

I would like to write the saturation condition as one "if condition" or even better with no "if condition" to make this code faster. I don't need saturation exactly at value 32000, any similar value like 32768 is acceptable.

According this page, there is a saturation instruction in ARM. Is there anything similar in x86/x64?

Tomas Kubes
  • 23,880
  • 18
  • 111
  • 148
  • 1
    Keep in mind that signed int overflow is Undefined Behaviour. – Jesper Juhl Aug 22 '18 at 04:34
  • 4
    `sum = (sum >> x)` can only bring the value of `sum` closer to 0, on common hardware, so it's not really clear what you are asking . The value can never be out of the range of +/- 32000 after that shift (unless it already was and the shift size is `0`). It would improve the question to show a real example of your calculation – M.M Aug 22 '18 at 04:41
  • There's the `packssdw` instruction (MMX `_mm_packs_pi32` or SSE2 `_mm_packs_epi32` intrinsics, among others), but these work on multiple values and you'd have to expend some effort getting your value in and out of the MMX register so you might not gain anything from it. – 1201ProgramAlarm Aug 22 '18 at 04:45
  • if you wrote some hand-coded assembly you could implement it without any actual branching logic using conditional moves. – Taekahn Aug 22 '18 at 05:02
  • possible duplicate of [Limit integer to bounds](https://stackoverflow.com/q/41535581/995714), [Convert values to values inside a range in c++ , optimized using boost or std](https://stackoverflow.com/q/23884149/995714), [Fastest way to clamp an integer to the range 0-255](https://codereview.stackexchange.com/q/6502/34190), [optimization of clamping rax to `[ 0 .. limit )`](https://stackoverflow.com/q/34071144/995714), [Most efficient/elegant way to clip a number?](https://stackoverflow.com/q/9323903/995714) – phuclv Aug 22 '18 at 05:28
  • What is the distribution of the values? Is it the case, that out of range values are rare? – geza Aug 22 '18 at 06:41
  • Yes, saturation is rare. – Tomas Kubes Aug 26 '18 at 05:56

2 Answers2

5

I'm not at all convinced that attempting to eliminate the if statement(s) is likely to do any real good. A quick check indicates that given this code:

int clamp(int x) {
    if (x < -32768)
        x = -32768;
    else if (x > 32767)
        x = 32767;
    return x;
}

...both gcc and Clang produce branch-free results like this:

clamp(int):
  cmp edi, 32767
  mov eax, 32767
  cmovg edi, eax
  mov eax, -32768
  cmp edi, -32768
  cmovge eax, edi
  ret

You can do something like x = std::min(std::max(x, -32768), 32767);, but this produces the same sequence, and the source seems less readable, at least to me.

You can do considerably better than this if you use Intel's vector instructions, but probably only if you're willing to put a fair amount of work into it--in particular, you'll probably need to operate on an entire (small) vector of values simultaneously to accomplish much this way. If you do go that way, you usually want to take a somewhat different approach to the task than you seem to be taking right now. Right now, you're apparently depending on int being a 32-bit type, so you're doing the arithmetic on a 32-bit type, then afterwards truncating it back down to a (saturated) 16-bit value.

With something like AVX, you'd typically want to use an instruction like _mm256_adds_epi16 to take a vector of 16 values (of 16-bits apiece), and do a saturating addition on all of them at once (or, likewise, _mm256_subs_epi16 to do saturating subtraction).

Since you're writing C++, what I've given above are the names of the compiler intrinsics used in most current compilers (gcc, icc, clang, msvc) for x86 processors. If you're writing assembly language directly, the instructions would be vpaddsw and vpsubsw respectively.

If you can count on a really current processor (one that supports AVX 512 instructions) you can use them instead to operate on a vector of 32 16-bit values simultaneously.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • As a bonus, this does actually compile to `ssat` on ARM32 https://godbolt.org/z/T_Do0J. (And to the same asm as the other answer's ternary operator.) – Peter Cordes Aug 22 '18 at 11:03
4

Are you sure you can beat the compiler at this?

Here's x64 retail with max size optimizations enabled. Visual Studio v15.7.5.

ecx contains the intial value at the start of this block. eax is filled with the saturated value when it is done.

    return (x > 32767) ? 32767 : ((x < -32768) ? -32768 : x);
mov         edx,0FFFF8000h  
movzx       eax,cx  
cmp         ecx,edx  
cmovl       eax,edx  
mov         edx,7FFFh  
cmp         ecx,edx  
movzx       eax,ax  
cmovg       eax,edx 
selbie
  • 100,020
  • 15
  • 103
  • 173