Suppose we have two register-length2 signed1 integers, say a
and b
. We want to compute the value (a + b) / 2
, either rounded up, down, towards zero, or away from zero, whichever way is easier (i.e. we do not care about the rounding direction).
The result is another register-length signed integer (it is clear that the average must be within the range of a register-length signed integer).
What is the fastest way to perform this computation?
You may choose which registers the two integers will initially be in, and which register the average ends up being in.
Footnote 1: For unsigned integers, we can do it in two instructions. This is perhaps the fastest way, although rotate-through-carry is more than 1 uop on Intel CPUs. But only a couple when the count is only 1. An answer on a Q&A about unsigned mean discusses the efficiency.
add rdi, rsi
rcr rdi, 1
The two numbers start in rdi
and rsi
, and the average ends up in rdi
. But for signed numbers, -1 + 3
would set CF, and rotate a 1
into the sign bit. Not giving the correct answer of +1
.
Footnote 2: I specified register-length signed integers so that we can't simply sign extend the integers with a movsxd
or cdqe
instruction.
The closest I've got towards a solution uses four instructions, one of them an rcr
that's 3 uops on Intel, 1 on AMD Zen (https://uops.info/):
add rdi, rsi
setge al
sub al, 1 # CF = !(ge) = !(SF==OF)
rcr rdi, 1 # shift CF into the top of (a+b)>>1
I think a shorter solution probably lies in combining the middle two instructions in some way, i.e. performing CF ← SF ≠ OF
.
I've seen this question, but that's not x86-specific and none of the answers seem to compile to something as good as my solution.