Fastest way to take the average of two signed integers in x86 assembly?

Question

Suppose we have two register-length² signed¹ integers, say a and b. We want to compute the value (a + b) / 2, either rounded up, down, towards zero, or away from zero, whichever way is easier (i.e. we do not care about the rounding direction).

The result is another register-length signed integer (it is clear that the average must be within the range of a register-length signed integer).

What is the fastest way to perform this computation?

You may choose which registers the two integers will initially be in, and which register the average ends up being in.

Footnote 1: For unsigned integers, we can do it in two instructions. This is perhaps the fastest way, although rotate-through-carry is more than 1 uop on Intel CPUs. But only a couple when the count is only 1. An answer on a Q&A about unsigned mean discusses the efficiency.

add rdi, rsi
rcr rdi, 1

The two numbers start in rdi and rsi, and the average ends up in rdi. But for signed numbers, -1 + 3 would set CF, and rotate a 1 into the sign bit. Not giving the correct answer of +1.

Footnote 2: I specified register-length signed integers so that we can't simply sign extend the integers with a movsxd or cdqe instruction.

The closest I've got towards a solution uses four instructions, one of them an rcr that's 3 uops on Intel, 1 on AMD Zen (https://uops.info/):

add rdi, rsi
setge al
sub al, 1          # CF = !(ge) = !(SF==OF)
rcr rdi, 1         # shift CF into the top of (a+b)>>1

I think a shorter solution probably lies in combining the middle two instructions in some way, i.e. performing CF ← SF ≠ OF.

I've seen this question, but that's not x86-specific and none of the answers seem to compile to something as good as my solution.

How does the `add rdi, rsi; rcr rdi, 1` approach not work for signed numbers? I don't quite see it. — fuz, Jul 25 '22 at 17:23
Try starting with `rdi` = -1 and `rsi` = 3. `add rdi, rsi` will set CF, and it will be rotated in to the sign bit of `rdi` by the `rcr rdi, 1` instruction, resulting in some negative number. But the correct answer is 1. — Bernard, Jul 25 '22 at 17:25
@Brendan Nope, try starting with both integers bigger than 2^30. Adding the two integers will set the sign bit, and so your `sar` instruction will keep the sign bit set, resulting in a negative integer. But the correct answer is positive. — Bernard, Jul 25 '22 at 17:42
@Bernard: Yeah, `add rdi, rsi; sar rdi,1` works until there's an overflow. — Brendan, Jul 25 '22 at 17:45
@Bernard In the now deleted comment I meant to first shift right and then add. This would work in this case but lead wrong results e.g. for `rsi = rdi = 1`. — fuz, Jul 25 '22 at 17:45
Hrm (using RAX instead of RDI): `cqo; add rax,rsi; adc rdx,0; shrd rax,rdx,1`. — Brendan, Jul 25 '22 at 17:49
@Brendan Also possible with any pair of registers if you replace `cqo` with `mov hi, reg; sar hi, 63`. The `mov` is a rename and thus essentially free. — fuz, Jul 25 '22 at 17:55
@Brendan: Fails for `rax=1`, `rsi=-1` I believe. You get a carry from the add. Effectively you are zero-extending the second operand, but you would really need to sign-extend both. — Nate Eldredge, Jul 25 '22 at 17:58
`shrd` seems to be slightly slowish according to this document: https://www.agner.org/optimize/instruction_tables.pdf — Bernard, Jul 25 '22 at 17:59
@NateEldredge: D'oh, you're right. Would need to promote RSI to 128-bit properly. Maybe `mov rbx,rsi; cqo; sar rbx,63; add rax,rsi; adc rdx,rbx; shrd rax,rdx,1`. — Brendan, Jul 25 '22 at 18:02
If my math is right, `sar rdi, 1 ; sar rsi, 1 ; add rdi, rsi` works unless both operands are odd, in which case we need to add 1 to the result. Maybe there is some way to use this? — Nate Eldredge, Jul 25 '22 at 18:03
@Brendan your solution fails for the same kind of input as fuz's first solution. — Bernard, Jul 25 '22 at 18:03
@NateEldredge I think changing `add` to `adc` in your solution makes it slightly better, but with some kind of inconsistent rounding. — Bernard, Jul 25 '22 at 18:07
@Bernard: SHRD by an immediate is fast on Intel mainstream CPUs, but a disaster on Alder Lake E-cores (15 uops, 13.6 cycle throughput according to https://uops.info/ testing). Also 6 uops on Zen2, with 3 cycle throughput and latency, so only good on Intel Sandybridge-family. By contrast, RCR-by-1 is 3 uops on Intel, 1 on AMD. So it's worst-case is much less bad. (RCR with counts other than 1 is terrible.) — Peter Cordes, Jul 25 '22 at 18:19
I don't think the `add rdi, rsi; setge rax; sub rax, 1; rcr rdi, 1` works for the `rdi = rsi = 0x8000000000000000` case. — Brendan, Jul 25 '22 at 18:19
Btw there is no such thing as `setge rax`; you have to use an 8-bit register. — Nate Eldredge, Jul 25 '22 at 18:34
@NateEldredge: [chux's C answer](//stackoverflow.com/questions/5697500/take-the-average-of-two-signed-numbers-in-c/29834136#29834136) on another question chooses to only use `a/2 + b/2 + (a%2 + b%2)/2;` when the inputs have the same sign, probably to make the rounding consistent. The both-odd case could be handled as `a&b&1`, which would naively be at least 3 instructions including a `mov` just to materialize it. `xor eax,eax` / `sar rdi,1` / `setc al` / `and rax, rsi` (before the other >>1) also works, needing an instruction to zero-extend a setcc because of unfortunate ISA design choices. — Peter Cordes, Jul 25 '22 at 18:34
@Brendan: No, I think `rdi = rsi = 0x8000000000000000` does work. The `add` leaves OF=1, SF=0, so `setge al` clears `al`, then `sub al, 1` sets the carry flag. — Nate Eldredge, Jul 25 '22 at 18:42
[R.. GitHub STOP HELPING ICE](https://stackoverflow.com/a/5697549/224132) suggests `(a&b)+(a^b)/2` for 2's complement integers, but remember that's signed division by 2 (truncating toward 0, not rounding towards -Inf like a right shift). So mov/and/xor, and then divide by 2 and add. PowerPC can do that efficiently with an arithmetic right shift fixed-up by an `addze`, but not x86-64 (https://godbolt.org/z/j1W8cWz46). — Peter Cordes, Jul 25 '22 at 19:08
@NateEldredge Yeah, you're right setge can only work with an 8-bit register. Now how do I fix my original solution without assuming that rax is initially zero... — Bernard, Jul 26 '22 at 15:47
Changed `rax` to `al`. Both `setge` and `sub` can operate on the `al` register. And we also save a few bytes of instruction memory too. — Bernard, Jul 26 '22 at 15:57
Why would you try to optimize a single calculation? How many calculations of that kind do you want to make? — Thomas Weller, Jul 26 '22 at 20:18
@alexis. I'm not sure what you mean... The average of `-1` and `3` is `1`. — Bernard, Jul 27 '22 at 01:23

Nate Eldredge · Accepted Answer · 2022-07-25T22:07:13.087

33

Depending on how we interpret your lax rounding requirements, the following may be acceptable:

sar rdi, 1
sar rsi, 1
adc rdi, rsi

Try on godbolt

This effectively divides both inputs by 2, adds the results, and adds 1 more if rsi was odd. (Remember that sar sets the carry flag according to the last bit shifted out.)

Since sar rounds to minus infinity, the result of this algorithm is:

exactly correct if rdi, rsi are both even or both odd
rounded down (toward minus infinity) if rdi is odd and rsi is even
rounded up (toward plus infinity) if rdi is even and rsi is odd

As a bonus, for random inputs, the average rounding error is zero.

It should be 3 uops on a typical CPU, with a latency of 2 cycles since the two sar are independent.

edited Jul 25 '22 at 22:07

answered Jul 25 '22 at 19:06

Nate Eldredge

48,811
6
54
82

3

"up" here is always towards +Inf, not symmetric around 0. But yeah, for `avg(-3,-3)`, we get `-2 + -2 + CF(1)`, so we get the correct `-3`. For `avg(3,3)`, we get `1 + 1 + CF(1)` which is also 3. So the always-upward of adding CF counteracts the towards -Inf rounding of arithmetic right shift. – Peter Cordes Jul 25 '22 at 20:06
@PeterCordes Would something like this be useful? lea eax,[edi+esi] shr eax,1 – vengy Jul 26 '22 at 12:50
@vengy: For unsigned inputs that are known to not overflow, yes. i.e. that are already zero-extended from 31 bits or narrower. (Except you'd never use 32-bit address-size with LEA, you'd use `lea eax, [rdi+rsi]` to avoid an address-size prefix). But this Q&A is about full-range signed inputs that may be negative. So even `sar eax, 1` wouldn't be sufficient. – Peter Cordes Jul 26 '22 at 12:55
Seems like this is the only faster solution so far, but with the drawback that the operation is not commutative. – Bernard Aug 02 '22 at 17:04

score 8 · Answer 2 · answered Jul 26 '22 at 03:15

8

As an outside answer, consider the pavg family of instructions.

I say "outside", since this is likely not acceptable to you. It assumes the value is unsigned 8-bit or 16-bit and in an SSE register, which of course also requires SSE. I mention it mainly since it is x86's anointed equivalent to averaging instructions in other ISAs.

In its defense, SSE is ubiquitous by now, even guaranteed on x86-64. Also, this instruction is 1 cycle, and actually can do 4 at once if you like. Best of all, unlike your original solutions, it also correctly handles overflow issues.

Note that it's possible to use an unsigned routine to implement a signed routine, though in general correctly accounting for overflow issues is a nightmare. Your current solution appears to already be broken in that regard, though.

answered Jul 26 '22 at 03:15

geometrian

14,775
10
56
132

Can you maybe range-shift signed to unsigned by adding 128 (i.e. flipping the high bit)? So `pxor` both inputs with `set1_epi8(0x80)`, `pavgb`, then `pxor` back to the signed range? I'd expect that to work even near overflow boundaries, since `pavgb`/`pavgw` does. And you can use other unsigned-rounding bithacks that don't rely on carry-out, if you want a vectorized version of this trick for 32-bit operand-size. (But yeah, not generally worth transferring data from GP integer regs to XMM and back for a single scalar average, especially of signed numbers.) – Peter Cordes Jul 26 '22 at 04:02
@PeterCordes I can't comment on algorithms for using this for signed; that sort of thing is obnoxiously difficult to get right and it's 2am right now. And yeah, the assumption is you're already in an XMM register. Actually, what inspired this answer was a recent image processing paper where this was used for a win; you get back a lot in parallelism doing this over a whole image, and images are often 8-bit unsigned so it's essentially a perfect use-case. – geometrian Jul 26 '22 at 09:06
"Your current solution appears to already be broken in that regard, though." Do you mean the fact that `setge` can only assign to an 8-bit register? Or is it something else I'm not aware of? – Bernard Jul 26 '22 at 15:43
I suppose this solution works, but if my integer was only 16 bits long then I could just perform addition without overflow in regular registers. Unless I'm using some ancient 16-bit x86 hardware. – Bernard Jul 26 '22 at 15:45
1

@Bernard but if you have lots of 8/16-bit integers then this will be much faster because it can do multiple additions at the same time – phuclv Jul 26 '22 at 15:56
Yeah I'm pretty sure this is much faster if we can vectorize the code. – Bernard Jul 26 '22 at 15:59
@Bernard To clarify, the objection is that your solutions do not correctly handle overflow, mainly because when the `add` instruction overflows, an incorrect result can be produced. (This seems to have been noticed and discussed on the question comments too.) – geometrian Jul 27 '22 at 14:52
As to performance, note that if the data comes from memory (likely), you can load it into an SSE register instead of an integer register just as easily, so it should be faster for scalar as well (though if you need subsequent general-purpose integer operations, you may need to move it out of SSE, making it merely equivalent-ish in perf.). One should of course profile / static-analyze these intuitions to know more precisely. – geometrian Jul 27 '22 at 14:53
@imallett I don't see any comments about my solutions incorrectly handling overflow. – Bernard Jul 27 '22 at 16:47

Fastest way to take the average of two signed integers in x86 assembly?

2 Answers2