is there a way to compute:
/** clipped(a - b) **/
unsigned char clipped_substract(unsigned char a, unsigned char b)
{
return a > b ? a - b : 0;
}
using some binary operations instead of a test?
is there a way to compute:
/** clipped(a - b) **/
unsigned char clipped_substract(unsigned char a, unsigned char b)
{
return a > b ? a - b : 0;
}
using some binary operations instead of a test?
The original function compiles to a conditional move and is not subject to the performance concern you have. You are attempting to perform premature optimization here, and it will do absolutely no good for you. Write readable code, then profile it, then identify and optimize performance bottlenecks.
You say "tests can lower performances". If you want to do micro-optimizations like this, you should aim to understand the reasons underlying these rules of thumb, not apply them as unconditional truth.
What lowers performance about "tests" are (mispredicted) control flow branches. Modern CPUs heavily use instruction pipelining, and control flow branches mean it's unclear which instruction comes next. The common approach is that the CPU guesses (using branch prediction hardware/algorithms) which way the branch will go. If it gets it wrong, it has to flush the entire pipeline and refill it, which wastes cycles.
Well, the code in question has no such issue. It compiles to a conditional move:
clipped_substract(unsigned char, unsigned char):
mov eax, edi
mov edx, 0
sub eax, esi
cmp dil, sil
cmovbe eax, edx
ret
You can see that there is no control flow branch here. See Why is a conditional move not vulnerable for Branch Prediction Failure? for why that is not subject to the above performance concern.
I repeat: Write readable code, then profile it, then identify and optimize performance bottlenecks. What you attempted here was to fix a performance problem code that didn't even exist in the first place (not to mention you didn't prove that the performance of this code is even relevant to your program's performance).
Using clang 11.0
, compiler optimizations enabled
unsigned char clipped_substract(unsigned char a, unsigned char b)
{
long c = b - a;
return ((unsigned long)c >> (sizeof(c)*8-1)) * c;
}
Although time taken by any of the alternatives is neglagibly different, but when performing 10000000
iterations, other ways take around 0.300secs
on my machine and this solution takes around 0.200secs
.
But this is only valid if you use a type whose size is greater than char
, for example here I took long
. I don't remember exact definitions of char
or long
but I know long
is wider on my machine compared to char
. So adjust it accordingly.
Basically, I just store the result of subtraction in a larger width variable. If the result is negative, most significant bit will be 1
else 0
.
/** clipped(a - b) **/
unsigned char clipped_substract(unsigned char a, unsigned char b)
{
return a - (a+b) / 2 + getAbs(a-b) / 2;
}
unsigned int getAbs(int n)
{
int const mask = n >> (sizeof(int) * 8 - 1);
return ((n + mask) ^ mask);
}