Optimized code in VC++ and ASM

Question

Good evening. Sorry, I used google tradutor. I use NASM in VC ++ on x86 and I'm learning how to use MASM on x64.

Is there any way to specify where each argument goes and the return of an assembly function in such a way that the compiler manages to leave the data there in the fastest way? We can too specify which registers will be used so that the compiler knows what data is still saved to make the best use of it?

For example, since there is no intrinsic function that applies the exactly IDIV r/m64 (64-bit signed integer division of assembly language), we may need to implement it. The IDIV requires that the low magnitude part of the dividend/numerator be in RAX, the high in RDX and the divisor/denominator in any register or in a region of memory. At the end, the quotient is in EAX and the remainder in EDX. We may therefore want to develop functions so (I put inutilities to exemplify):

void DivLongLongInt( long long NumLow , long long NumHigh , long long Den , long long *Quo , long long *Rem ){
    __asm(
        // Specify used register: [rax], specify pre location: NumLow --> [rax]
        reg(rax)=NumLow ,
        // Specify used register: [rdx], specify  pre location: NumHigh --> [rdx]
        reg(rdx)=NumHigh ,
        // Specify required memory: memory64bits [den], specify pre location: Den --> [den]
        mem[64](den)=Den ,
        // Specify used register: [st0], specify pre location: Const(12.5) --> [st0]
        reg(st0)=25*0.5 ,
        // Specify used register: [bh]
        reg(bh) ,
        // Specify required memory: memory64bits [nothing]
        mem[64](nothing) ,
        // Specify used register: [st1]
        reg(st1)
    ){
        // Specify code
        IDIV [den]
    }(
        // Specify pos location: [rax] --> *Quo
        *Quo=reg(rax) ,
        // Specify pos location: [rdx] --> *Rem
        *Rem=reg(rdx)
    ) ;
}

Is it possible to do something at least close to that? Thanks for all help.

If there is no way to do this, it's a shame because it would certainly be a great way to implement high-level functions with assembly-level features. I think it's a simple interface between C ++ and ASM that should already exist and enable assembly code to be embedded inline and at high level, practically as simple C++ code.

Visual C++ does not support inline assembly at all for x64 native or ARM. Intrinsics are preferred as they leave register scheduling to the optimizer without lots of strange boundaries forced by inline asm. — Chuck Walbourn, Jul 12 '17 at 05:45
In addition to what chuck said, `IDIV` may not work as you expect. Consider: `RDX = 0x00000fff, RAX = 0x00000002` What do you expect to happen when you do `idiv rax`? Well, it should divide RDX:RAX by 2 and put the quotient in rax. But 0xfff00000002 divided by 2 is 0x8793945538561 and that's too big for a 64bit register. As a result, you get an "integer overflow exception." Doing this type of 128 bit division only works if RDX < divisor. That tends to limit how useful it is. — David Wohlferd, Jul 12 '17 at 06:20
David, I know. But exists exception handling for it. In same problems don't have exceptions. This can also be used to get 128 bit quotient and 64 bit remainder using two IDIV. It's very good! — RHER WOLF, Jul 12 '17 at 08:09
Ok, as long as you understand how it works. That said, since you can't use inline asm in 64bit code using VS, you will need to write this as a straight asm routine. Calling the routine from c is a simple matter of declaring the routine as an extern. The calling convention puts the first 4 parameters in rcx, rdx, r8, r9. This isn't ideal for your needs, but it's workable. Returning your results will be a bigger challenge. You may find [agner's work](http://www.agner.org/optimize/calling_conventions.pdf) useful here. — David Wohlferd, Jul 12 '17 at 08:54
This is the kind of thing that is trivial with gcc extended assembly, but is impossible with VC++. — Matteo Italia, Jul 12 '17 at 10:01

Cody Gray - on strike · Accepted Answer · 2017-07-12T17:09:02.417

As others have mentioned, MSVC does not support any form of inline assembly when targeting x86-64.

Inline assembly is supported only in x86-32 builds, and even there, it is rather limited in what you can do. In particular, you can't specify inputs and outputs, so the use of inline assembly necessarily entails a lot of shuffling of values back and forth between registers and memory, which is precisely the opposite of what you want when writing high-performance code. Unless there is something that you cannot possibly do any other way except by causing the manual emission of machine code, you should avoid the inline assembler. Its original purpose was to do things like generate OUT instructions and call ROM BIOS interrupts in obsolete 8-bit and 16-bit programming environments. It made it into the 32-bit compiler for compatibility purposes, but the team drew the line with 64-bit.

Intrinsics are now the recommended solution, because these play much better with the optimizer. Virtually any SIMD code that you need the compiler to generate can be accomplished using intrinsics, just as you would on most any other compiler targeting x86, so not only are you getting better code, but you're also getting slightly more portable code.

Even on Gnu-style compilers that support extended asm blocks, which give you the type of input/output operand power that you are looking for, there are still lots of good reasons to avoid the use of inline asm. Intrinsics are still a better solution there, as is finding a way to represent what you want in C and persuading the compiler to generate the assembly code that you wish it to emit.

The only exception is cases where there are no intrinsics available. The IDIV instruction is, unfortunately, one of those cases. (There are intrinsics available for 128-bit multiplication. They go by various names: either Windows-specific or compiler-specific.)

On Gnu compilers that support 128-bit integer types as an extension on 64-bit targets, you can get the compiler to generate the code for you:

__int128_t dividend = 1234;
int64_t    divisor  = 64;
int64_t    quotient = (dividend / divisor);

Now, this is generally compiled as a call to their library function that does 128-bit division, rather than an inline IDIV instruction that returns a 64-bit quotient. Presumably, this is because of the need to handle overflows, as David mentioned. Actually, it's worse than that. No C or C++ implementation can use the DIV/IDIV instructions because they are non-conforming. These instructions will result in overflow exceptions, whereas the standard says that the result should be truncated. (With multiplication, you do get inline IMUL/MUL instruction(s) because these don't have the overflow problem, since they return 128-bit results.)

This isn't actually as big of a loss as you might think. You seem to be assuming that the 64-bit IDIV instruction is really fast. It isn't. Although the actual numbers vary depending on the number of significant bits in the absolute value of the dividend, your values probably are quite large if you actually need the range of a 128-bit integer. Looking at Agner Fog's instruction tables will give you some idea of the performance you can expect on various architectures. It's getting faster on newer architectures (especially on the newer AMD processors; it's still sluggish on Intel), but it still has pretty substantial latencies. Just because it's one instruction doesn't mean that it runs in one cycle or anything like that. A single instruction might be good for code density when you're optimizing for size and worried about a call to a library function evicting other instructions from your cache, but division is a slow enough operation that this usually doesn't matter. In fact, division is so slow that compilers try very hard not to use it—whenever possible, they will do multiplication by the reciprocal, which is significantly faster. And if you're really needing to do multiplications quickly, you should look into parallelizing them with SIMD instructions, which all have intrinsics available.

Back to MSVC (although everything I said in the last paragraph still applies, of course), there are no 128-bit integer types, so if you need to implement this type of division, you will need to write the code in an external assembly module and link it in. The code is pretty simple, and Visual Studio has excellent, built-in support for assembling code with MASM and linking it directly into your project:

; Windows 64-bit calling convention passes parameters as follows:
; RCX == first  64-bit integer parameter (low bits of dividend)
; RDX == second 64-bit integer parameter (high bits of dividend)
; R8  == third  64-bit integer parameter (divisor)
; R9  == fourth 64-bit integer parameter (pointer to remainder)
Div128x64 PROC
    mov  rax, rcx
    idiv r8          ; 128-bit divide (RDX:RAX / R8)
    mov  [r9], rdx   ; store remainder
    ret              ; return, with quotient in RDX:RAX
Div128x64 ENDP

Then you just prototype that in your C++ code as:

extern int64_t Div128x64(int64_t  loDividend,
                         int64_t  hiDividend,
                         int64_t  divisor,
                         int64_t* pRemainder);

and you're done. Call it as desired.

The equivalent can be written for unsigned division, using the DIV instruction.

No, you don't get intelligent register allocation, but this isn't really a big deal with register renaming in the front end that can often elide register-register moves entirely (in other words, MOVs become zero-latency operations). Plus, the IDIV instruction is so restrictive anyway in terms of its operands, since they are hardcoded to RAX and RDX, that it's pretty unlikely a scheduler would be able to keep the values in those registers anyway, at least for any non-trivial piece of code.

Beware that once you write the necessary code to check for the possibility of overflows, or worse—the code to handle exceptions—this will very likely end up performing the same or worse as a library function that does a proper 128-bit division, so you should arguably just write and use that (until such time as Microsoft sees fit to provide one). That can be written in C (also see implementation of __divti3 library function for Gnu compilers), which makes it a candidate for inlining and otherwise plays better with the optimizer.

MSVC has intrinsics for 128/64 => 64-bit `idiv` and `div`, namely `_div128` and `_udiv128`. See [128-bit division intrinsic in Visual C++](https://stackoverflow.com/a/56033459). (But those may not have existed until near when this answer was written!) Anyway, in new code you definitely want those intrinsics, not the overhead of a `call`. — Peter Cordes, Jan 07 '22 at 13:28

score 0 · Answer 2 · edited Jul 12 '17 at 15:50

No, it is not possible to do this. MSVC doesn't support inline assembly for x64 builds. Instead, you should use intrinsics; almost everything is available. The sad thing is, as far as I know, 128-bit idiv is missing from the intrinsics.

A note: you can solve your issue with two movs (to put inputs in the correct registers). And you should not worry about that; current CPUs handle mov very well. Putting mov into a code may not slow it down at all. And div is very expensive compared to a mov, so it doesn't matter too much.

Optimized code in VC++ and ASM

2 Answers2