By far the most important optimization you can do is use a fixed-point multiplicative inverse for division by a compile-time constant: Why does GCC use multiplication by a strange number in implementing integer division?.
Any decent C compiler will do that for you, but apparently Delphi won't, so there is a valid reason for doing it with asm.
Can you return a value in EAX instead of storing both the quotient and remainder to memory? Seems like a waste to pass 2 pointer args, and force the caller to retrieve the value from memory. (Update, yes I think you can by making it a function instead of a procedure; I'm just blindly modifying Delphi code from other answers, though.)
Anyway, fortunately we can get a C compiler to do the hard work of figuring out the multiplicative inverse and the shift counts for us. We can even get it to use the same "calling convention" that it looks like Delphi is using for inline asm. GCC's regparm=3
32-bit calling convention passes args in EAX, EDX, and ECX (in that order).
You might want to make a separate version for cases where you only need the quotient, because (unlike the slow div
instruction), you have to compute the remainder separately as x - (x/y)*y
if you're using a fast multiplicative inverse. But yes that's still about twice to 4x as fast on modern x86.
Or you could leave the remainder calculation to be done in pure Delphi, unless the compiler is just terrible at optimizing in general.
#ifdef _MSC_VER
#define CONVENTION _fastcall // not the same, but 2 register args are better than none.
#else
#define CONVENTION __attribute__((regparm(3)))
#endif
// use gcc -Os to get it to emit code with actual div.
divmod10(unsigned x, unsigned *quot, unsigned *rem) {
unsigned tmp = x/10;
// *quot = tmp;
*rem = x%10;
return tmp;
}
From the Godbolt compiler explorer:
# gcc8.2 -O3 -Wall -m32
div10: # simplified version without the remainder, returns in EAX
mov edx, -858993459 # 0xCCCCCCCD
mul edx # EDX:EAX = dividend * 0xCCCCCCCD
mov eax, edx
shr eax, 3
ret
# quotient in EAX
# returns quotient in EAX, stores remainder to [ECX]
# quotient pointer in EDX is unused (and destroyed).
divmod10:
mov edx, -858993459
push ebx
mov ebx, eax
mul edx # EDX:EAX = dividend * 0xCCCCCCCD
mov eax, edx
shr eax, 3
# quotient in EAX = high_half(product) >> 3 = product >> (32+3)
lea edx, [eax+eax*4] # EDX = quotient*5
add edx, edx # EDX = quot * 10
sub ebx, edx # remainder = dividend - quot*10
mov DWORD PTR [ecx], ebx # store remainder
pop ebx
ret
# quotient in EAX
This is C compiler output. Adapt as necessary to Delphi inline asm; the inputs are in the right registers for Delphi, I think.
If Delphi inline-asm doesn't let you clobber EDX, you can save/restore it. Or you want to remove the unused quotient
pointer input, then you can adjust the asm, or adjust the C on Godbolt and look at the new compiler output.
This is more instructions than with div
, but div
is very slow (10 uops, and 26 cycle latency even on Skylake.)
If you have a 64-bit integer type in Delphi, you can do this in Delphi source and avoid inline asm. Or as MBo shows, you can use $CCCD
as a multiplicative inverse for inputs that are in the 0..2^16-1 range using only 32-bit integer types.
For the remainder, the store/reload round trip (4 to 5 cycles) has similar latency to the actual calculation on a recent Intel CPU with mov-elimination (3 + 1 to quotient, + another 3 for the lea/add/sub = 7), so having to use inline asm for this is pretty crap. But it's still better than a div
instruction for latency and throughput. See https://agner.org/optimize/ and other performance links in the x86 tag wiki.
Delphi version you can copy/paste
(If I got this right, I don't know Delphi, and just copied+modified examples here on SO and this site, based on what I infer about the calling-convention / syntax)
I'm not sure I got the arg-passing right for inline-asm. This RADStudio documentation says "Except for ESP and EBP, an asm statement can assume nothing about register contents on entry to the statement." But I'm assuming args are in EAX and EDX.
Using asm for 64-bit code might be silly, because in 64-bit you can efficiently use pure Pascal for 64-bit multiplication. How do I implement an efficient 32 bit DivMod in 64 bit code. So in the {$IFDEF CPUX64}
blocks, the best choice might be pure pascal using UInt64(3435973837)*num;
function Div10(Num: Cardinal): Cardinal;
{$IFDEF PUREPASCAL}
begin
Result := Num div 10;
end;
{$ELSE !PUREPASCAL}
{$IFDEF CPUX86}
asm
MOV EDX, $CCCCCCCD
MUL EDX // EDX:EAX = Num * fixed-point inverse
MOV EAX,EDX // mov then overwrite is ideal for Intel mov-elimination
SHR EAX,3
end;
{$ENDIF CPUX86}
{$IFDEF CPUX64}
asm
// TODO: use pure pascal for this; Uint64 is efficient on x86-64
// Num in ECX, upper bits of RCX possibly contain garbage?
mov eax, ecx // zero extend Num into RAX
mov ecx, $CCCCCCCD // doesn't quite fit in a sign-extended 32-bit immediate for imul
imul rax, rcx // RAX = Num * fixed-point inverse
shr rax, 35 // quotient = eax
end;
{$ENDIF CPUX64}
{$ENDIF}
{Remainder is the function return value}
function DivMod10(Num: Cardinal; var Quotient: Cardinal): Cardinal;
{$IFDEF PUREPASCAL}
begin
Quotient := Num div 10;
Result := Num mod 10;
end;
{$ELSE !PUREPASCAL}
{$IFDEF CPUX86}
asm
// Num in EAX, @Quotient in EDX
push esi
mov ecx, edx // save @quotient
mov edx, $CCCCCCCD
mov esi, eax // save dividend for use in remainder calc
mul edx // EDX:EAX = dividend * 0xCCCCCCCD
shr edx, 3 // EDX = quotient
mov [ecx], edx // store quotient into @quotient
lea edx, [edx + 4*edx] // EDX = quot * 5
add edx, edx // EDX = quot * 10
mov eax, esi // off the critical path
sub eax, edx // Num - (Num/10)*10
pop esi
// Remainder in EAX = return value
end;
{$ENDIF CPUX86}
{$IFDEF CPUX64}
asm
// TODO: use pure pascal for this? Uint64 is efficient on x86-64
// Num in ECX, @Quotient in RDX
mov r8d, ecx // zero-extend Num into R8
mov eax, $CCCCCCCD
imul rax, r8
shr rax, 35 // quotient in eax
lea ecx, [rax + 4*rax]
add ecx, ecx // ecx = 10*(Num/10)
mov [rdx], eax // store quotient
mov eax, r8d // copy Num again
sub eax, ecx // remainder = Num - 10*(Num/10)
// we could have saved 1 mov instruction by returning the quotient
// and storing the remainder. But this balances latency better.
end;
{$ENDIF CPUX64}
{$ENDIF}
Storing the quotient and returning the remainder means that both might be ready at about the same time in the caller, because the extra latency of computing the remainder from the quotient overlaps with the store-forwarding. IDK if that's good, or if having out-of-order execution get started on some work based on the quotient is more often better. I'm going to guess that if you call DivMod10, you might only want the remainder.
But in a split-into-decimal-digits loop that repeatedly divides by 10, it's the quotient that forms the critical path, so a version of this that returned the quotient and stored the remainder would be a much better choice there.
In that case you'd make the quotient the return value in EAX, and rename the function arg to the remainder.
The asm is based on clang output for this version of this C function (https://godbolt.org/z/qu2kvV), targeting the Windows x64 calling convention. But with some tweaks to make it more efficient, e.g. taking mov
off the critical path, and using different registers to avoid REX prefixes. And replacing one LEA with just an ADD.
unsigned divmod10(unsigned x, unsigned *quot) {
unsigned qtmp = x/10;
unsigned rtmp = x%10;
*quot = qtmp;
//*rem = rtmp;
return rtmp;
}
I used clang's version instead of gcc's because imul r64,r64
is faster on Intel CPUs and Ryzen (3 cycle latency / 1 uop). mul r32
is 3 uops, and only 1 per 2 clocks throughput on Sandybridge-family. I think the multiply hardware naturally produces a 128-bit result, and splitting the low 64 of that into edx:eax takes an extra uop, or something like that.