5

How can I tell the MSVC compiler to use the 64bit/32bit division operation to compute the result of the following function for the x86-64 target:

#include <stdint.h> 

uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
  if (a > b)
        return ((uint64_t)b<<32) / a;   //Yes, this must be casted because the result of b<<32 is undefined
  else
        return uint32_t(-1);
}

I would like the code, when the if statement is true, to compile to use the 64bit/32bit division operation, e.g. something like this:

; Assume arguments on entry are: Dividend in EDX, Divisor in ECX
mov edx, edx  ;A dummy instruction to indicate that the dividend is already where it is supposed to be
xor eax,eax
div ecx   ; EAX = EDX:EAX / ECX

...however the x64 MSVC compiler insists on using the 128bit/64bit div instruction, such as:

mov     eax, edx
xor     edx, edx
shl     rax, 32                             ; Scale up the dividend
mov     ecx, ecx
div rcx   ;RAX = RDX:RAX / RCX

See: https://www.godbolt.org/z/VBK4R71

According to the answer to this question, the 128bit/64bit div instruction is not faster than the 64bit/32bit div instruction.

This is a problem because it unnecessarily slows down my DSP algorithm which makes millions of these scaled divisions.

I tested this optimization by patching the executable to use the 64bit/32bit div instruction: The performance increased 28% according to the two timestamps yielded by the rdtsc instructions.

(Editor's note: presumably on some recent Intel CPU. AMD CPUs don't need this micro-optimization, as explained in the linked Q&A.)

George Robinson
  • 1,500
  • 9
  • 21
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/195308/discussion-on-question-by-george-robinson-how-can-i-tell-the-msvc-compiler-to-us). – Samuel Liew Jun 20 '19 at 23:38
  • have you considered using https://libdivide.com – Joseph Wood Jul 04 '19 at 00:38
  • @JosephWood: I looked at it but it does not offer the control of the hardware division width. – George Robinson Jul 04 '19 at 07:55

2 Answers2

6

No current compilers (gcc/clang/ICC/MSVC) will do this optimization from portable ISO C source, even if you let them prove that b < a so the quotient will fit in 32 bits. (For example with GNU C if(b>=a) __builtin_unreachable(); on Godbolt). This is a missed optimization; until that's fixed, you have to work around it with intrinsics or inline asm.

(Or use a GPU or SIMD instead; if you have the same divisor for many elements see https://libdivide.com/ for SIMD to compute a multiplicative inverse once and apply it repeatedly.)


_udiv64 is available starting in Visual Studio 2019 RTM.

In C mode (-TC) it's apparently always defined. In C++ mode, you need to #include <immintrin.h>, as per the Microsoft docs. or intrin.h.

https://godbolt.org/z/vVZ25L (Or on Godbolt.ms because recent MSVC on the main Godbolt site is not working1.)

#include <stdint.h>
#include <immintrin.h>       // defines the prototype

// pre-condition: a > b else 64/32-bit division overflows
uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
    uint32_t remainder;
    uint64_t d = ((uint64_t) b) << 32;
    return _udiv64(d, a, &remainder);
}

int main() {
    uint32_t c = ScaledDiv(5, 4);
    return c;
}

_udiv64 will produce 64/32 div. The two shifts left and right are a missed optimization.

;; MSVC 19.20 -O2 -TC
a$ = 8
b$ = 16
ScaledDiv PROC                                      ; COMDAT
        mov     edx, edx
        shl     rdx, 32                             ; 00000020H
        mov     rax, rdx
        shr     rdx, 32                             ; 00000020H
        div     ecx
        ret     0
ScaledDiv ENDP

main    PROC                                            ; COMDAT
        xor     eax, eax
        mov     edx, 4
        mov     ecx, 5
        div     ecx
        ret     0
main    ENDP

So we can see that MSVC doesn't do constant-propagation through _udiv64, even though in this case it doesn't overflow and it could have compiled main to just mov eax, 0ccccccccH / ret.


UPDATE #2 https://godbolt.org/z/n3Dyp- Added a solution with Intel C++ Compiler, but this is less efficient and will defeat constant-propagation because it's inline asm.

#include <stdio.h>
#include <stdint.h>

__declspec(regcall, naked) uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
    __asm mov edx, eax
    __asm xor eax, eax
    __asm div ecx
    __asm ret
    // implicit return of EAX is supported by MSVC, and hopefully ICC
    // even when inlining + optimizing
}

int main()
{
    uint32_t a = 3 , b = 4, c = ScaledDiv(a, b);
    printf( "(%u << 32) / %u = %u\n", a, b, c);
    uint32_t d = ((uint64_t)a << 32) / b;
    printf( "(%u << 32) / %u = %u\n", a, b, d);
    return c != d;
}

Footnote 1: Matt Godbolt's main site's non-WINE MSVC compilers are temporarily(?) gone. Microsoft runs https://www.godbolt.ms/ to host the recent MSVC compilers on real Windows, and normally the main Godbolt.org site relayed to that for MSVC.)

It seems godbolt.ms will generate short links, but not expand them again! Full links are better anyway for their resistance to link-rot.

Alex Lopatin
  • 682
  • 4
  • 8
  • ...but the dividend `b` must be scaled up ( shifted left by 32 bits or multiplied by 2^32 ) BEFORE the division. Your example has only a 16 bit shift, which changes things considerably when the dividend is originally 32 bits wide. See: https://godbolt.org/z/1MK73- – George Robinson Jun 19 '19 at 08:56
  • What do you expect if you shift 32-bit number left by 32 bit in C? 0 or undefined? – Alex Lopatin Jun 19 '19 at 09:06
  • I expect a zero or garbage. That's why I wrote in the original comment, that the dividend must be casted. Although I did not want to suggest a specific casting (as this might be a solution) I have edited the OP and added the minimal casting now because its lack was causing too much confusion. – George Robinson Jun 19 '19 at 09:10
  • Instruction `div` can have divisor size 8, 16, 32, or 64-bit. Dividend accordingly twice the size of divisor: 16, 32, 64, or 128-bit. 64/128 instruction exists only in the x64 set of instruction. This is why when you switched to 64-bit arguments in x86 it is a function call instead of `div` instruction. The size of divisor defines the type of `div`: `int8_t a`, ..., or `int64_t a` – Alex Lopatin Jun 19 '19 at 09:24
  • 1
    The comment above is for `div` instruction in assembly. In C: dividend and divisor cast to the higher bit number. This is why you get 128/64 div when one of them is 64-bit. – Alex Lopatin Jun 19 '19 at 09:57
  • Yes, I know. Originally I wrote that this is for the x86-64 target so we can keep it narrowed down to that architecture. In my function the input argument (the divisor) is 32-bit wide so I would like the type of `div` to be 64bit/32bit. Without casting the dividend it will not compile correctly. Right now I am patching the opcodes in the binary executable to `div r32` (and getting a significant performance increase), but I really would like to do this in a legal and portable manner. How can I update my original question to make it more clear? – George Robinson Jun 19 '19 at 09:59
  • 2
    _udiv64 intrinsic is available starting in Visual Studio 2019 RTM or you can write a separate assembly file. – Alex Lopatin Jun 19 '19 at 10:45
  • OK, I guess I'll have to upgrade to VS2019 and live with the compiler intrinsic `_udiv64`. It is either that ...or binary patching... BTW: Did you see that weird pair of `shl` and `shr` instructions that the compiler has generated, when only `xor eax,eax` and `div ecx` would have been sufficient ? – George Robinson Jun 19 '19 at 11:42
  • Yes, I did notice that in x86 they don't use shit instruction at all, but in x64 do extra heavy shifting ;-) – Alex Lopatin Jun 19 '19 at 12:13
  • Another option could be Intel C++ Compiler (ICC) if you use Intel CPU. It is integrated with VS: easy to switch from VSC to ICC. It has the inline assembly for x64. It will probably produce the fastest optimized code. – Alex Lopatin Jun 19 '19 at 13:25
  • I forgot about the Intel C++ Compiler integration with VS ! Does it have an intrinsic for the unsigned 64bit/32bit division ? – George Robinson Jun 19 '19 at 20:20
  • Check your last UPDATED godbolt.org link. It stopped working! – George Robinson Jun 19 '19 at 20:50
  • Yes, it stopped because of gotbolt: they removed all msvc compilers from the list. I didn't find an intrinsic for a single division but two/four divisions: _mm_div_epu64/ _mm256_div_epu64. Calculates quotient of a division operation. Vector variant of div() function for unsigned 64-bit integer arguments. – Alex Lopatin Jun 20 '19 at 02:17
  • Added a solution with Intel C++ Compiler. – Alex Lopatin Jun 20 '19 at 04:08
  • 1
    @GeorgeRobinson: https://godbolt.org/ does still have WINE versions of MSVC for C and C++. Weird that Matt removed the other MSVC versions though; they'll hopefully be back. – Peter Cordes Jun 20 '19 at 18:57
  • 1
    @GeorgeRobinson and Alex: `_mm256_div_epu64` isn't a real intrinsic, it's a (slow) helper function. If you want repeated division by the *same* divisor, look at https://libdivide.com/. You might need to hand-roll your own stuff for this left-shifted-dividend thing, because that probably factors out when using a multiplicative inverse for division, but the libdivide library functions probably don't have a way to handle just 32-bit high-half-only SIMD inputs. – Peter Cordes Jun 20 '19 at 19:00
  • 1
    @AlexLopatin: the whole first half of this answer is useless, and probably left over from a badly written first version of the question. The only useful part is `_udiv64`. Also, your first 2 asm output blocks are from un-optimized code. That's why it's spilling the args to the shadow space and reloading. That's not useful. – Peter Cordes Jun 20 '19 at 19:07
  • 1
    Update: compiler-explorer with up-to-date MSVC is still available run by Microsoft on https://www.godbolt.ms/. Normally Matt Godbolt's https://godbolt.org/ proxies to that site for Windows installs of MSVC (Explanation at https://github.com/mattgodbolt/compiler-explorer/issues/1168#issuecomment-485438416). That's the part that's currently down, but you can use x64 MSVC 19.20 on https://www.godbolt.ms/ directly. – Peter Cordes Jun 20 '19 at 19:25
  • @PeterCordes: Thank you for the valuable input and editing my answer. Sorry, I can't upvote the answer but can do this to your comments :-) – Alex Lopatin Jun 20 '19 at 21:33
  • 1
    Glad you liked my edit. I'm not worried about rep. If I was, I'd have posted my own answer. :P I thought it was more useful for SO to have one good answer, and the good answer was right there buried inside your existing one. – Peter Cordes Jun 20 '19 at 21:36
  • @GeorgeRobinson and Alex: a `naked` function can't inline. `call`/`ret` overhead is probably less bad than bouncing the inputs through memory, but both ways suck. (And `naked` isn't 100% obviously better, because `asm` inside a regular function can inline into a loop. It still can only take its inputs as memory operands, though.) If you only care about ICC, not 32-bit MSVC, **ICC supports GNU C inline asm syntax** which is lets you avoid both overheads. It still blocks constant-propagation, but see [this example](//stackoverflow.com/a/10781271). – Peter Cordes Jun 21 '19 at 04:57
  • See also [What is the difference between 'asm', '\_\_asm' and '\_\_asm\_\_'?](//stackoverflow.com/a/35959859) for an example wrapping 64-bit / 32-bit => 32-bit `idiv`. (Use it for `div` by just changing the mnemonic and the types to unsigned.) – Peter Cordes Jun 21 '19 at 05:17
4

@Alex Lopatin's answer shows how to use _udiv64 to get non-terrible scalar code (despite MSVC's stupid missed optimization shifting left/right).

For compilers that support GNU C inline asm (including ICC), you can use that instead of the inefficient MSVC inline asm syntax that has a lot of overhead for wrapping a single instruction. See What is the difference between 'asm', '__asm' and '__asm__'? for an example wrapping 64-bit / 32-bit => 32-bit idiv. (Use it for div by just changing the mnemonic and the types to unsigned.) GNU C doesn't have an intrinsic for 64 / 32 or 128 / 64 division; it's supposed to optimize pure C. But unfortunately GCC / Clang / ICC have missed optimizations for this case even using if(a<=b) __builtin_unreachable(); to promise that a>b.


But that's still scalar division, with pretty poor throughput.

Perhaps you can a GPU for your DSP task? If you have a large enough batch of work (and the rest of your algorithm is GPU-friendly) then it's probably worth the overhead of the communication round trip to the GPU.

If you're using the CPU, then anything we can suggest will benefit from parallelizing over multiple cores, so do that for more throughput.


x86 SIMD (SSE4/AVX2/AVX512*) doesn't have SIMD integer division in hardware. The Intel SVML functions _mm_div_epu64 and _mm256_div_epu64 are not intrinsics for a real instruction, they're slow functions that maybe unpack to scalar or compute multiplicative inverses. Or whatever other trick they use; possibly the 32-bit division functions convert to SIMD vectors of double, especially if AVX512 is available. (Intel still calls them "intrinsics" maybe because they're like built-in function that it understands and can do constant-propagation through. They're probably as efficient as they can be, but that's "not very", and they need to handle the general case, not just your special case with the low half of one divisor being all zero and the quotient fitting in 32 bits.)

If you have the same divisor for many elements, see https://libdivide.com/ for SIMD to compute a multiplicative inverse once and apply it repeatedly. (You should adapt that technique to bake in the shifting of the dividend without actually doing it, leaving the all-zero low half implicit.)

If your divisor is always varying, and this isn't a middle step in some larger SIMD-friendly algorithm, scalar division may well be your best bet if you need exact results.


You could get big speedups from using SIMD float if 24-bit mantissa precision is sufficient

uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
    return ((1ULL<<32) * (float)b) / a;
}

(float)(1ULL<<32) is a compile-time constant 4294967296.0f.

This does auto-vectorize over an array, with gcc and clang even without -ffast-math (but not MSVC). See it on Godbolt. You could port gcc or clang's asm back to intrinsics for MSVC; they use some FP tricks for packed-conversion of unsigned integers to/from float without AVX512. Non-vectorized scalar FP will probably be slower than plain integer on MSVC, as well as less accurate.

For example, Skylake's div r32 throughput is 1 per 6 cycles. But its AVX vdivps ymm throughput is one instruction (of 8 floats) per 5 cycles. Or for 128-bit SSE2, divps xmm has one per 3 cycle throughput. So you get about 10x the division throughput from AVX on Skylake. (8 * 6/5 = 9.6) Older microarchitectures have much slower SIMD FP division, but also somewhat slower integer division. In general the ratio is smaller because older CPUs don't have as wide SIMD dividers, so 256-bit vdivps has to run the 128-bit halves through separately. But there's still plenty of gain to be had, like better than a factor of 4 on Haswell. And Ryzen has vdivps ymm throughput of 6c, but div 32 throughput of 14-30 cycles. So that's an even bigger speedup than Skylake.

If the rest of your DSP task can benefit from SIMD, the overall speedup should be very good. float operations have higher latency, so out-of-order execution has to work harder to hide that latency and overlap execution of independent loop iterations. So IDK whether it would be better for you to just convert to float and back for this one operation, or to change your algorithm to work with float everywhere. It depends what else you need to do with your numbers.


If your unsigned numbers actually fit into signed 32-bit integers, you can use direct hardware support for packed SIMD int32 -> float conversion. Otherwise you need AVX512F for packed uint32 -> float with a single instruction, but that can be emulated with some loss of efficiency. That's what gcc/clang do when auto-vectorizing with AVX2, and why MSVC doesn't auto-vectorize.

MSVC does auto-vectorize with int32_t instead of uint32_t (and gcc/clang can make more efficient code), so prefer that if the highest bit of your integer inputs and/or outputs can't be set. (i.e. the 2's complement interpretation of their bit-patterns will be non-negative.)

With AVX especially, vdivps is slow enough to mostly hide the throughput costs of converting from integer and back, unless there's other useful work that could have overlapped instead.


Floating point precision:

A float stores numbers as significand * 2^exp where the significand is in the range [1.0, 2.0). (Or [0, 1.0) for subnormals). A single-precision float has 24-bits of significand precision, including the 1 implicit bit.

https://en.wikipedia.org/wiki/Single-precision_floating-point_format

So the 24 most-significant digits of an integer can be represented, the rest lost to rounding error. An integer like (uint64_t)b << 32 is no problem for float; that just means a larger exponent. The low bits are all zero.

For example, b = 123105810 gives us 528735427897589760 for b64 << 32. Converting that to float directly from 64-bit integer gives us 528735419307655168, a rounding error of 0.0000016%, or about 2^-25.8. That's unsurprising: the max rounding error is 0.5ulp (units in the last place), or 2^-25, and this number was even so it had 1 trailing zero anyway. That's the same relative error we'd get from converting 123105810; the resulting float is also the same except for its exponent field (which is higher by 32).

(I used https://www.h-schmidt.net/FloatConverter/IEEE754.html to check this.)

float's max exponent is large enough to hold integers outside the INT64_MIN to INT64_MAX range. The low bits of the large integers that float can represent are all zero, but that's exactly what you have with b<<32. So you're only losing the low 9 bits of b in the worst case where it's full-range and odd.

If the important part of your result is the most-significant bits, and having the low ~9 integer bits = rounding error is ok after converting back to integer, then float is perfect for you.

If float doesn't work, double may be an option.

divpd is about twice as slow as divps on many CPUs, and only does half as much work (2 double elements instead of 4 float). So you lose a factor of 4 throughput this way.

But every 32-bit integer can be represented exactly as a double. And by converting back with truncation towards zero, I think you get exact integer division for all pairs of inputs, unless double-rounding is a problem (first to nearest double, then truncation). You can test it with

// exactly correct for most inputs at least, maybe all.
uint32_t quotient = ((1ULL<<32) * (double)b) / a;

The unsigned long long constant (1ULL<<32) is converted to double, so you have 2x u32 -> double conversions (of a and b), a double multiply, a double divide, and a double -> u32 conversion. x86-64 can do all of these efficiently with scalar conversions (by zero extending uint32_t into int64_t, or ignoring the high bits of a double->int64_t conversion), but it will probably still be slower than div r32.

Converting u32 -> double and back (without AVX512) is maybe even more expensive that converting u32 -> float, but clang does auto-vectorize it. (Just change float to double in the godbolt link above). Again it would help a lot if your inputs were all <= INT32_MAX so they could be treated as signed integers for FP conversion.

If double-rounding is a problem, you could maybe set the FP rounding mode to truncation instead of the default round-to-nearest, if you don't use FP for anything else in the thread where your DSP code is running.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Intel says it is "Intrinsics for Division Operations". So, why you refer to the Intel site and say it is not intrinsic? – Alex Lopatin Jun 20 '19 at 22:55
  • @AlexLopatin: Because it's really a function. Intel doesn't distinguish between real intrinsics (for CPU instructions) vs. their SVML library functions, but the difference in performance is very real. I guess I disagree with Intel over the meaning of the word "intrinsic". `_mm_sin_ps` is clearly not an intrinsic, for example. `_mm_set_epi32(a, b, c, d)` is not really an intrinsic (there's no instruction that reads 4 integer registers or memory locations), but there's no algorithm or computation that it's emulating so we might as well call it an intrinsic. – Peter Cordes Jun 20 '19 at 22:56
  • Ok, I see - the prototypes. But Intel already has the hardware according to https://en.wikipedia.org/wiki/AVX-512 and here is one https://www.intel.com/content/www/us/en/products/processors/core/x-series/i9-7980xe.html for $1803 – Alex Lopatin Jun 20 '19 at 23:16
  • @AlexLopatin: what AVX512 instruction are you talking about? AVX512 still only has SIMD *FP* division which can't emulate `_mm_div_epu64` exactly. You can convert unsigned 64-bit integers to `double` with AVX512, but that introduces rounding. Oh and re: "intrinsics": those division functions are under the parent heading "Intrinsics for Short Vector Math Library Operations". I guess it makes sense to call them intrinsics if the compiler understands what they do and how to optimize / transform them, even though they don't map to a single instruction. – Peter Cordes Jun 20 '19 at 23:33
  • @Peter Cordes: My embedded Windows system does not have a GPU and I prefer not to mix floats with integers, The entire DSP algorithm is done on integers. – George Robinson Jun 20 '19 at 23:44
  • @GeorgeRobinson: Not even an Intel integrated GPU? I assume it is an Intel CPU from the benchmark speedup you got, though. An iGPU can be used for OpenCL. Anyway, some parts of this answer are only going to be useful to future readers with the same problem in a different context, not you. – Peter Cordes Jun 20 '19 at 23:47
  • @GeorgeRobinson: *I prefer not to mix floats with integers* Enjoy integer division being slow, then. If you want to take advantage of the SIMD hardware dividers, you need to use `float` or `double`. (Converting to `double` makes division slower, and fewer elements per vector, so the possible speedup is lower. But it might be possible to get results that are exact for every possible pair of 32-bit inputs.) – Peter Cordes Jun 20 '19 at 23:49
  • @PeterCordes: Yes, you are right about _mm_div_epu64. I cannot find any correspondent instruction in AVX-512 set. Intel confused me with putting it under "Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Instructions" umbrella in "Intel® C++ Compiler 19.0 Developer Guide and Reference" document – Alex Lopatin Jun 20 '19 at 23:59
  • 1
    @PeterCordes Might want to get used to it. FP-phobia is a thing and people are willing to sacrifice an order of magnitude in performance to avoid floating-point - even when the usage is provably correct. Just don't tell people that the integer divide is implemented in hardware using the floating-point divide. – Mysticial Jun 21 '19 at 00:12
  • @Mystical: I'd like to read more about this FP-phobia. Do you know of any good articles? – George Robinson Jun 21 '19 at 00:35
  • @GeorgeRobinson: You're commenting under the wrong answer. But anyway, inline asm is not a great idea; you don't want call/ret overhead inside a hot loop. That's worse than letting MSVC emit it's stupid shifts. I didn't delete it from Alex's answer but I wouldn't recommend it. Compile a function containing a `_udiv64` loop with MSVC and call it from ICC if you want to use ICC for the rest of your program. Or hand-write the loop in asm instead of using compile-generated code that's full of missed optimizations anyway (like MSVC's shifts). Or write a whole loop in asm for ICC. – Peter Cordes Jun 21 '19 at 00:36
  • @Peter. I don't think it has a GPU. It is an old Intel Pentium in a R&S ZVL-3 Network Analyzer. – George Robinson Jun 21 '19 at 00:37
  • @GeorgeRobinson: Interesting! I assumed you had a "normal" mini-ITX or something. But what kind of Pentium? Clearly it's not a P5 Pentium from 1993, that don't have 64-bit mode. Is it a Sandybridge Pentium? 64-bit Pentium 4 (Nocona)? Anyway, either way even if it's a Skylake Pentium, it won't have AVX unfortunately. So you can only use 128-bit SIMD vectors at best, and maybe without SSE4.1 (which some compilers use when auto-vectorizing uint32_t <-> float) – Peter Cordes Jun 21 '19 at 00:43
  • @GeorgeRobinson: updated my code with `__declspec(regcall, naked)`. Thank you! I assumed that you will compile with `inline` but always good to know some extra tricks. – Alex Lopatin Jun 21 '19 at 04:55
  • @Peter: The CPU is the Pentium4 model 505. I am sorry for making a comment in the wrong section. It was a simple oversight. I have deleted it. I agree with you that a naked function with `ret` will prevent beneficial inlining. Do you know how to force ICC x86-64 to pass the function arguments in registers without creating a stack frame and generaly without mucking with the stack inside the function, unnecessarily? `-Oy` does not work. – George Robinson Jun 21 '19 at 11:21
  • @GeorgeRobinson: If you decide to use inline asm, use GNU C inline asm as described in this answer, instead of the inefficient MSVC inline asm. A GNU C inline asm wrapper function can inline to a single instruction inside a loop that the compiler can optimize around. It's designed for wrapping single instructions efficiently. – Peter Cordes Jun 21 '19 at 16:34