GCC uses a completely different syntax for inline assembly than MSVC does, so it's quite a bit of work to maintain both forms. It's not an especially good idea, either. There are many problems with inline assembly. People often use it because they think it'll make their code run faster, but it usually has quite the opposite effect. Unless you're an expert in both assembly language and the compiler's code-generation strategies, you are far better off letting the compiler's optimizer generate the code.
When you try to do that, you will have to be a bit careful here, though: signed right shifts are implementation-defined in C, so if you care about portability, you need to cast the value to an equivalent unsigned type:
#include <limits.h> // for CHAR_BIT
signed long ROR13(signed long val)
{
return ((unsigned long)val >> 13) |
((unsigned long)val << ((sizeof(val) * CHAR_BIT) - 13));
}
(See also Best practices for circular shift (rotate) operations in C++).
This will have the same semantics as your original code: ROR val, 13
. In fact, MSVC will generate precisely that object code, as will GCC. (Clang, interestingly, will do ROL val, 19
, which produces the same result, given the way that rotations work. ICC 17 generates an extended shift instead: SHLD val, val, 19
. I'm not sure why; maybe that's faster than rotation on certain Intel processors, or maybe it's the same on Intel but slower on AMD.)
To implement Div16
in pure C, you want:
signed long Div16(signed long a, signed long b)
{
return ((long long)a << 16) / b;
}
On a 64-bit architecture that can do native 64-bit division, (assuming long
is still a 32-bit type like on Windows) this will be transformed into:
movsxd rax, a # sign-extend from 32 to 64, if long wasn't already 64-bit
shl rax, 16
cqo # sign-extend rax into rdx:rax
movsxd rcx, b
idiv rcx # or idiv b if the inputs were already 64-bit
ret
Unfortunately, on 32-bit x86, the code isn't nearly as good. Compilers emit a call into their internal library function that provides extended 64-bit division, because they can't prove that using a single 64b/32b => 32b idiv
instruction won't fault. (It will raise a #DE
exception if the quotient doesn't fit in eax
, rather than just truncating)
In other words, transforming:
int32_t Divide(int64_t a, int32_t b)
{
return (a / b);
}
into:
mov eax, a_low
mov edx, a_high
idiv b # will fault if a/b is outside [-2^32, 2^32-1]
ret
is not a legal optimization—the compiler is unable to emit this code. The language standard says that a 64/32 division is promoted to a 64/64 division, which always produces a 64-bit result. That you later cast or coerce that 64-bit result to a 32-bit value is irrelevant to the semantics of the division operation itself. Faulting for some combinations of a
and b
would violate the as-if rule, unless the compiler can prove that those combinations of a
and b
are impossible. (For example, if b
was known to be greater than 1<<16
, this could be a legal optimization for a = (int32_t)input; a <<= 16;
But even though this would produce the same behaviour as the C abstract machine for all inputs, gcc and clang
currently don't do that optimization.)
There simply isn't a good way to override the rules imposed by the language standard and force the compiler to emit the desired object code. MSVC doesn't offer an intrinsic for it (although there is a Windows API function, MulDiv
, it's not fast, and just uses inline assembly for its own implementation—and with a bug in a certain case, now cemented thanks to the need for backwards compatibility). You essentially have no choice but to resort to assembly, either inline or linked in from an external module.
So, you get into ugliness. It looks like this:
signed long Div16(signed long a, signed long b)
{
#ifdef __GNUC__ // A GNU-style compiler (e.g., GCC, Clang, etc.)
signed long quotient;
signed long remainder; // (unused, but necessary to signal clobbering)
__asm__("idivl %[divisor]"
: "=a" (quotient),
"=d" (remainder)
: "0" ((unsigned long)a << 16),
"1" (a >> 16),
[divisor] "rm" (b)
:
);
return quotient;
#elif _MSC_VER // A Microsoft-style compiler (i.e., MSVC)
__asm
{
mov eax, DWORD PTR [a]
mov edx, eax
shl eax, 16
sar edx, 16
idiv DWORD PTR [b]
// leave result in EAX, where it will be returned
}
#else
#error "Unsupported compiler"
#endif
}
This results in the desired output on both Microsoft and GNU-style compilers.
Well, mostly. For some reason, when you use the rm
constraint, which gives the compiler to freedom to choose whether to treat the divisor as either a memory operand or load it into a register, Clang generates worse object code than if you just use r
(which forces it to load it into a register). This doesn't affect GCC or ICC. If you care about the quality of output on Clang, you'll probably just want to use r
, since this will give equally good object code on all compilers.
Live Demo on Godbolt Compiler Explorer
(Note: GCC uses the SAL
mnemonic in its output, instead of the SHL
mnemonic. These are identical instructions—the difference only matters for right shifts—and all sane assembly programmers use SHL
. I have no idea why GCC emits SAL
, but you can just convert it mentally into SHL
.)