I'm making some improvements to a C++ large-integer library primarily targeting Visual C++ (2012 and later) on x64, and I'd really like to improve the speed of my division routine by using wider words.
Right now the operation produces 16 bits of quotient per iteration, with a primitive operation looking basically like this:
uint16_t U[], V[];
uint32_t u = (uint32_t(U[i+1]) << 16) | U[i];
uint16_t v = V[j];
uint16_t q = uint16_t(u/v);
That produces the IDIV instruction with 32-bit operands (and EDX zeroed), which is fine, but slow thanks to the large number of iterations. I'd really like to use IDIV's support for 64/32 or even 128/64 division, but I cannot convince Visual C++ to let me use them. Dividing a 64-bit number by a 32-bit number results in a call to the internal 64/64 div routine, which is not particularly fast, and which is total overkill (since my code makes sure that the quotient will never overflow). And I can't even touch the 128/64 division since there's no support for 128-bit numbers.
Normally this is where intrinsics would come in, but VC++ doesn't seem to offer an intrinsic to use the high operand in division (as it does through __umulh
for multiplication). Without support for inline assembly in x64, it's looking like the only solution is to reimplement the routine entirely in ASM, which I want to avoid if at all possible.
How can I use the full power of IDIV in a VC++ long division routine?