Signed saturation bithack tricks
Note the saturation return values (0x7FFFFFFF and 0x80000000) are bitwise NOT of each other, and that we could generate them with tmp ^ 0x7FFFFFFF
where tmp = x>>31
, all-zero or all-one according to the sign. It's also ((unsigned)x>>31) + (unsigned)INT_MAX
, 0x7FFFFFFFu + (0 or 1)
. (Unsigned to avoid signed-overflow UB in C.)
That's what Rust does, for x.saturating_mul(2)
or x.saturating_add(x)
. (Yes, that's right, Rust has saturating add and mul as primitive operations on all primitive integer types. If you were using Rust, would you count this as one "operation"? CPUs don't run source code, they run asm. Some languages have more limited operations than most CPUs, especially C which also lacks rotates, popcount, clz/ctz; C++ only added those in C++20 with #include <bit>
.)
// Rust
pub fn saturating_mul2(x: i32) -> i32 {
x.saturating_mul(2)
}
pub fn saturating_add_self(x: i32) -> i32 {
x.saturating_add(x)
}
Godbolt with asm for x86-64, AArch64, and --target armv7-unknown-linux-gnueabi
:
example::saturating_mul2: @@ ARMv7
adds r1, r0, r0 @ add, updating flags (S suffix)
mvn r2, #-2147483648 @ r2 = ~0x80000000 = 0x7FFFFFFF
eorvs r1, r2, r0, asr #31 @ if (adds overflowed) r1 = 0x7FFFFFFF ^ (x>>31);
@ eorvs is predicated on V Set = signed overflow, otherwise r1 left unmodified holding x+x
mov r0, r1 @ get the result into the same reg as the input
bx lr
example::saturating_add_self:
qadd r0, r0, r0 @ yup, ARMv7 has an instruction for this!
bx lr @ the mul(2) version is a missed optimization
Or with AArch64:
example::saturating_mul2:
asr w8, w0, #31
adds w9, w0, w0 // x+x and set flags
eor w8, w8, #0x7fffffff
csel w0, w8, w9, vs // select
ret
example::saturating_add_self:
adds w8, w0, w0 // x+x and set flags
asr w9, w8, #31 // (x<<1)>>31 is the opposite of the sign of the correct result, if there was overflow
eor w9, w9, #0x80000000
csel w0, w9, w8, vs
ret
Note that LLVM's strategy for the add_self
version is like a general x+y
saturation. It uses the opposite of the sign of the x+x
result from adds
, so it has worse latency and instruction-level parallelism. (The right shift has to wait for the add
, instead of running separately on the same input). Signed overflow is when the actual result has a different sign from the mathematically correct result, so this is a good trick if you have two separate inputs that might have different signs. But actually that's not necessary: +
can only overflow if both inputs have the same sign, so you can just pick either one.
x86-64 comes out pretty similar to the AArch64 versions, but with an extra mov
instruction. (Or with the .saturating_add
version, redoing the +
, once with lea
to feed a sar
right shift, once with add
to generate the potential return value and set OF for a cmov
.)
There's an optional / proposed C extension with _Sat
types; see an SO answer, and an example of using it for addition: https://godbolt.org/z/5EdP1EnxT
How to get a C compiler to emit asm like this?
Generating the saturation result as (x>>31) ^ 0x7FFFFFFF
is easy. At least if you can assume >>
is arithmetic right shift. >>
on signed negative integer is implementation-defined behaviour, but all mainstream compilers choose to define it in a useful way, at least on 2's complement systems.
So it's just a matter of detecting signed overflow in a left shift somehow.
Unfortunately in ISO C, signed integer overflow is undefined behaviour, including in x << 1
. GNU C does define the behaviour of shifts even when they overflow (unlike for x+x
, for which you'd have to use __builtin_sadd_overflow
), so IDK how portable you care about being.
If you're willing to use GNU C overflow detection builtins, (which can often compile to use the overflow flag set by addition, so that really is one primitive machine operation) see Signed saturated add of 64-bit ints? - for AArch64, clang emits 4 instructions, comparable to what rustc
's saturating math does, although using add x9, x9, x1, lsr #63
to do INT64_MAX
+ 0 or 1.
If you only care about 2's complement machines, you could use an unsigned type in the C source for the left shift. The whole bithack basically already assumes that, so we only need to use signed types for comparing to 0, or for arithmetic right shift.
#include <stdint.h>
// int32_t is guaranteed to be 2's complement with exactly 32 value bits, if it exists
// >> on it is *not* guaranteed to be arithmetic, but we can fairly safely assume that
int saturating_shl (int32_t x)
{
int32_t saturated = (x>>31) ^ 0x7FFFFFFF; // 2 operations, plus maybe a MOV
int32_t x2 = ((uint32_t)x) << 1; // 1 op. avoid signed-integer overflow
int32_t overflow = (x ^ x2); // 1 op. negative if x and x*2 have different signs, i.e. overflow
return (overflow < 0) ? saturated : x2; // 1 op (cmov)
}
This is sub-optimal; the compiler fails to use the signed-overflow result of the ALU add or left-shift instruction, instead actually doing an XOR with the original value. But it is still only the 5 operations in the source code, e.g. GCC for x86-64 (Godbolt)
saturating_shl:
mov edx, edi
lea eax, [rdi+rdi] # x<<1
sar edx, 31 # x>>31
xor edx, 2147483647 # saturated
xor edi, eax # x ^ x2 to set FLAGS
cmovs eax, edx # SF is set when (x^x2) < 0
ret
Update: I missed that the question arbitrarily disallowed the ternary operator. I'm not going to edit further as @njuffa posted an answer that avoids it, compiling to the same instructions as the ternary version with clang. (Although worse asm with GCC, which doesn't sort out the blend idiom back to a cmov / csel.)
Counting operations - CPU asm, not C operators
Real CPUs run machine code, not C operators. What matters when micro-optimizing is how efficiently your code can compile for specific machines. It's common for modern machines to have a conditional-select instruction like x86 cmov
or AArch64 csel
, so expressions like ((~temp) & rval) | (temp & flow);
that manually blend according to a mask can hopefully compile to one machine operation, using a FLAGS condition instead of generating an integer mask from it.
Or if this is auto-vectorizing with SIMD, then a SIMD compare already produces a mask of all-0 / all-1 elements, and many ISAs have blend instructions which can apply them, like x86's SSE4.1 pblendvb
. Also many ISAs have an andnot instruction like x86's SSE2 pandn
which does (~mask) & rval
in a single instruction, so that blend takes 3 cheap instructions, vs. one less-cheap pblendvb
which is 2 uops on some CPUs; https://uops.info/.
But on the other hand, extra reg-reg mov
or movdqa
instructions are part of the cost you have to worry about, on machines like x86 without AVX that can only do z = x;
z &= y;
, not z = x&y
in a single instruction.
TL:DR: counting the operators in C might be a rough proxy for cost, but it's not exact. There are other considerations than throughput, too, like latency (critical path length from inputs to output), and instruction-level parallelism.
Certain machines have instructions that can do things C doesn't have a single operator for. For example, 32-bit ARM has qadd
, saturating signed addition, so a sufficiently smart compiler could optimize your function to one instruction if you write it correctly and the compiler recognizes the "idiom" you're using. Compilers can do this in practice for a few things, like rotates with (x >> (n&31)) | (x << ((-n)&31))
compiling to x86 rol
And many real CPUs set FLAGS based on signed overflow, or on the MSB, so a compare and ternary can sometimes be easier for the compiler to sort out than a right shift and using a mask.
Related: