Signed numbers modulus in instruction level

Question

Reading this interesting discussion while ago regarding the different implementations of signed numbers modulus calculation in gcc and clang raised a question for me (which is not discussed in the discussion).

Why on earth, the implementation of this:

if(num % 2 == 1)

starts with this (similar at both clang and gcc):

movl    %edi, %eax
shrl    $31, %eax
addl    %edi, %eax

Why are we starting with ((num >> 31) + num) ? why taking the MSB and add it to the number? where is this coming from?

I doubt you have optimization enabled. In general that is done to handle rounding for negative numbers but `% 2` should be optimized to `& 1`. — Jester, Oct 22 '18 at 16:22
@jester `& 1` does not work for modulus of negative numbers. — izac89, Oct 22 '18 at 16:23
Ah yeah forgot that C modulo gives negative remainder. Anyway, the MSB is the sign bit. So that's done to handle the negatives. You'd need to look at the complete code to see how it's affecting the result. — Jester, Oct 22 '18 at 16:25

Antti Haapala -- Слава Україні · Accepted Answer · 2018-10-22T16:50:57.037

The result of negative % 2 is the negative remainder, with division rounded towards 0 in C. & 1 works only for positive numbers.

So the compiler generates code that adds the sign bit numbers - essentially increasing negative numbers by one, so that -1 has 0 in the last bit, then -2 has 1 in the last bit, -3 has 0 again... then we mask out all but the the last bit and subtract the sign bit from the result, for example:

movl    %edi, %edx
shrl    $31, %edx
leal    (%rdi,%rdx), %eax
andl    $1, %eax
subl    %edx, %eax

It is perhaps slightly more performant than the idiv instruction. On Core i7 for example, IDIV with r32 will have latency of 17-28 and throughput 7-12 where all these others in the generated code have them ~1 each

"perhaps slightly" is a big understatement. It has under ~1/3rd the latency, or maybe 7x the throughput if used repeatedly so the divider would be a bottleneck. (Or ~twice the throughput if just plain front-end uop throughput is the bottleneck, rather than the unpipelined divider.) — Peter Cordes, Oct 23 '18 at 00:15

Signed numbers modulus in instruction level

1 Answers1