MSVC Inline ASM to GCC

Question

I'm trying to handle both MSVC and GCC compilers while updating this code base to work on GCC. But I'm unsure exactly how GCCs inline ASM works. Now I'm not great at translating ASM to C else I would just use C instead of ASM.

SLONG Div16(signed long a, signed long b)
{
    signed long v;
#ifdef __GNUC__ // GCC doesnt work.
__asm() {
#else // MSVC
__asm {
#endif
        mov edx, a
        mov ebx, b          
        mov eax, edx           
        shl eax, 16          
        sar edx, 16            
        idiv ebx              
        mov v, eax              
    }
    return v;
}

signed long ROR13(signed long val)
{
    _asm{ 
        ror val, 13
    }
}

I assume ROR13 works something like (val << 13) | (val >> (32 - 13)) but the code doesn't produce the same output.

What is the proper way to translate this inline ASM to GCC and/or whats the C translation of this code?

Different syntax for the "asm" directive is not the only problem you have. GCC uses a [different syntax](https://en.wikibooks.org/wiki/X86_Assembly/GAS_Syntax) even for the assembler instructions. — Some programmer dude, Sep 15 '17 at 22:33
`ror` is rotate **right** so `(val >> 13) | (val << (32 - 13))` — Jester, Sep 15 '17 at 22:33
[Compiler intrinsics](https://en.wikipedia.org/wiki/Intrinsic_function) may help you. For example this Visual Studio [x86 Intrinsics List](https://msdn.microsoft.com/en-us/library/hh977023.aspx) — Weather Vane, Sep 15 '17 at 22:37
`Div16` seems to be just `return ((long long)a << 16) / b` (assuming `long` is 32 bits, and `long long` is 64) — Jester, Sep 15 '17 at 22:41
gcc doesn't generate that asm on x86 for shift+sign-extend. (https://godbolt.org/g/ZUQ5tp). Reported as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82245. It generates similar shl / sar code on other architectures, like ARM or PowerPC though (see the bug report). — Peter Cordes, Sep 19 '17 at 13:49

score 7 · Accepted Answer · edited Sep 19 '17 at 05:56

GCC uses a completely different syntax for inline assembly than MSVC does, so it's quite a bit of work to maintain both forms. It's not an especially good idea, either. There are many problems with inline assembly. People often use it because they think it'll make their code run faster, but it usually has quite the opposite effect. Unless you're an expert in both assembly language and the compiler's code-generation strategies, you are far better off letting the compiler's optimizer generate the code.

When you try to do that, you will have to be a bit careful here, though: signed right shifts are implementation-defined in C, so if you care about portability, you need to cast the value to an equivalent unsigned type:

#include <limits.h>   // for CHAR_BIT

signed long ROR13(signed long val)
{
    return ((unsigned long)val >> 13) |
           ((unsigned long)val << ((sizeof(val) * CHAR_BIT) - 13));
}

(See also Best practices for circular shift (rotate) operations in C++).

This will have the same semantics as your original code: ROR val, 13. In fact, MSVC will generate precisely that object code, as will GCC. (Clang, interestingly, will do ROL val, 19, which produces the same result, given the way that rotations work. ICC 17 generates an extended shift instead: SHLD val, val, 19. I'm not sure why; maybe that's faster than rotation on certain Intel processors, or maybe it's the same on Intel but slower on AMD.)

To implement Div16 in pure C, you want:

signed long Div16(signed long a, signed long b)
{
    return ((long long)a << 16) / b;
}

On a 64-bit architecture that can do native 64-bit division, (assuming long is still a 32-bit type like on Windows) this will be transformed into:

movsxd  rax, a   # sign-extend from 32 to 64, if long wasn't already 64-bit
shl     rax, 16
cqo              # sign-extend rax into rdx:rax
movsxd  rcx, b
idiv    rcx      # or  idiv b  if the inputs were already 64-bit
ret

Unfortunately, on 32-bit x86, the code isn't nearly as good. Compilers emit a call into their internal library function that provides extended 64-bit division, because they can't prove that using a single 64b/32b => 32b idiv instruction won't fault. (It will raise a #DE exception if the quotient doesn't fit in eax, rather than just truncating)

In other words, transforming:

int32_t Divide(int64_t a, int32_t b)
{
    return (a / b);
}

into:

mov   eax, a_low
mov   edx, a_high
idiv  b                 # will fault if a/b is outside [-2^32, 2^32-1]
ret

is not a legal optimization—the compiler is unable to emit this code. The language standard says that a 64/32 division is promoted to a 64/64 division, which always produces a 64-bit result. That you later cast or coerce that 64-bit result to a 32-bit value is irrelevant to the semantics of the division operation itself. Faulting for some combinations of a and b would violate the as-if rule, unless the compiler can prove that those combinations of a and b are impossible. (For example, if b was known to be greater than 1<<16, this could be a legal optimization for a = (int32_t)input; a <<= 16; But even though this would produce the same behaviour as the C abstract machine for all inputs, gcc and clang currently don't do that optimization.)

There simply isn't a good way to override the rules imposed by the language standard and force the compiler to emit the desired object code. MSVC doesn't offer an intrinsic for it (although there is a Windows API function, MulDiv, it's not fast, and just uses inline assembly for its own implementation—and with a bug in a certain case, now cemented thanks to the need for backwards compatibility). You essentially have no choice but to resort to assembly, either inline or linked in from an external module.

So, you get into ugliness. It looks like this:

signed long Div16(signed long a, signed long b)
{
#ifdef __GNUC__     // A GNU-style compiler (e.g., GCC, Clang, etc.)
    signed long quotient;
    signed long remainder;  // (unused, but necessary to signal clobbering)
    __asm__("idivl  %[divisor]"
           :          "=a"  (quotient),
                      "=d"  (remainder)
           :           "0"  ((unsigned long)a << 16),
                       "1"  (a >> 16),
             [divisor] "rm" (b)
           : 
           );
    return quotient;
#elif _MSC_VER      // A Microsoft-style compiler (i.e., MSVC)
    __asm
    {
        mov  eax, DWORD PTR [a]
        mov  edx, eax
        shl  eax, 16
        sar  edx, 16
        idiv DWORD PTR [b]
        // leave result in EAX, where it will be returned
    }
#else
    #error "Unsupported compiler"
#endif
}

This results in the desired output on both Microsoft and GNU-style compilers.

Well, mostly. For some reason, when you use the rm constraint, which gives the compiler to freedom to choose whether to treat the divisor as either a memory operand or load it into a register, Clang generates worse object code than if you just use r (which forces it to load it into a register). This doesn't affect GCC or ICC. If you care about the quality of output on Clang, you'll probably just want to use r, since this will give equally good object code on all compilers.

Live Demo on Godbolt Compiler Explorer

(Note: GCC uses the SAL mnemonic in its output, instead of the SHL mnemonic. These are identical instructions—the difference only matters for right shifts—and all sane assembly programmers use SHL. I have no idea why GCC emits SAL, but you can just convert it mentally into SHL.)

Why ICC rotates with SHLD: the cynical reason is that it's *slower* on AMD (6 uops, 3c latency, on Bulldozer/Ryzen), but the same cost as ROR on Intel after Nehalem. The less-cynical is that it doesn't so much recognize the rotate idiom and instead just implements shift-and-OR with SHLD when the shift counts complement each other correctly. — Peter Cordes, Sep 19 '17 at 04:49
It does recognize rotates and use BMI2 RORX when possible, though. https://godbolt.org/g/GL5YQw. (See also https://stackoverflow.com/questions/776508/best-practices-for-circular-shift-rotate-operations-in-c for safe UB-free variable-count rotates, which also makes the point that you really want to name your function `ror32` instead of rotating by whatever the width of `long` is, if you're trying to make a portable function!) — Peter Cordes, Sep 19 '17 at 04:57
Your 64-bit native division example is weird / wrong. A C compiler will use 64-bit operand-size for the `idiv` as well as the shifts. So it will `movsx rcx, b` / `movsx rax, a` / `shl rax, 16` / `cqo` (not cdq) / `idiv rcx` (https://godbolt.org/g/9Gf9Z3), because the inputs aren't already sign-extended to 64-bit (assuming your inputs are `int32_t`, like `long` on Windows or in general in 32-bit). And it's weird to show `shl a, 16` instead of shifting after copying into a known register. — Peter Cordes, Sep 19 '17 at 05:25
I just realized my edit changed your C example to talk about a different optimization. You were talking about just using a 32/32 => 32b division (`mov eax, a` / `cdq`). That's obviously invalid, and isn't the optimization that the OP is aiming for with inline asm, so I thought it was more useful to talk about that. When I made the edit, I was thinking that's what you meant to talk about in the first place, but got mixed up somehow. Anyway, I think it's a better answer now, but wanted to give you a heads up on what I did to your answer. — Peter Cordes, Sep 19 '17 at 05:51
I wrote it with a splitting headache, so I'm not surprised some mistakes snuck in there. Thanks for fixing, @Peter! Regarding ICC, it looks like the cynical reason is the one more likely to be correct. Older versions of ICC do, in fact, use `ROL`. It's not until ICC 17 (which is the only one I originally looked at) that you see `SHLD`, which is consistent with when Intel might stop caring about CPUs older than Nehalem. :-( For the 64-bit native division, MSVC actually uses the `CDQ` mnemonic, so I wasn't totally crazy. I basically copied its output. Except...I used the wrong reg sizes. Heh... — Cody Gray - on strike, Sep 19 '17 at 10:30
Update on SHLD vs. ROR: clang will use SHLD on Sandybridge-family to avoid partial-flags updates from rotates. https://github.com/llvm-mirror/llvm/blob/master/lib/Target/X86/X86.td#L310. This is only possible win with immediate counts: `shld r,r,cl` is 4 uops on SnB/HSW/SKL, but immediate-count `ror` and `shld` are both 1 uop. It's maybe good on SnB specifically, but HSW has lower throughput and higher latency for `shld` than `ror` according to Agner Fog. Either way, it doesn't seem like a great idea unless you're avoiding the `ror r,1` special encoding, which decodes to 2 uops. — Peter Cordes, Jan 29 '18 at 04:13
The `ror reg,1` short-form encoding has extra flag semantics: `OF` is defined, instead of being undefined for variable-count rotates. Intel's [Description section](https://github.com/HJLebbink/asm-dude/wiki/RCL_RCR_ROL_ROR) just says "1-bit rotates", but I think that means the special encodings only, *not* when the other encodings or variable-count rotates rotate by 1. This would explain why `ror r,imm` is 1 uop while `ror r,1` is 2 uops. — Peter Cordes, Jan 29 '18 at 04:20

MSVC Inline ASM to GCC

1 Answers1

Linked