5

In Delphi math.pas unit there is a procedure DivMod that i want to convert it into inline and optimize it for divisor to be always 10 . But I dont know details of Pentagon ASM . What is the conversion of bellow procedure

 procedure DivMod(Dividend: Integer; Divisor: Word;
  var Result, Remainder: Word);
asm
        PUSH    EBX
        MOV     EBX,EDX
        MOV     EDX,EAX
        SHR     EDX,16
        DIV     BX
        MOV     EBX,Remainder
        MOV     [ECX],AX
        MOV     [EBX],DX
        POP     EBX
end;
  • To optimize for a constant divisor of 10, you don't want to use the `div` instruction at all. [Why does GCC use multiplication by a strange number in implementing integer division?](https://stackoverflow.com/q/41183935). Are you sure your compiler doesn't already emit efficient code using a multiplicative inverse when you use plain Delphi code? If your divisor is a constant 10, you don't need any weirdness with half-width results, just let everything be 32-bit integer. – Peter Cordes Oct 29 '18 at 04:18
  • @Peter Cordes Delphi compiler optimizes division by powers of two, but does not make efficient division for other constants – MBo Oct 29 '18 at 05:10
  • @MBo: wow, that's pretty crappy. All the mainstream C compilers have been doing it for nearly 3 decades. (e.g. https://gmplib.org/~tege/divcnst-pldi94.pdf from 1994 shows results from the already-existing implementation in gcc, and documents the math. It's a very large throughput win on modern x86 with 1-per-clock multiplication, like a factor of 5 to 10 if you're doing a lot of division. Or maybe half that if you also need remainder.) – Peter Cordes Oct 29 '18 at 05:36
  • 2
    See also this answer: [How do I implement an efficient 32 bit DivMod in 64 bit code](https://stackoverflow.com/a/20287946/576719) – LU RD Oct 29 '18 at 07:31
  • 2
    "Pentagon ASM"? Did you mean "Pentium"? Anyway, for 10, the best way is indeed the multiplication with a constant (and, if necessary, a shift right). @Peter Cordes: Delphi does not optimize a lot. It has become better, but is not nearly like GCC. – Rudy Velthuis Oct 29 '18 at 10:11
  • @PeterCodes: FWIW, the new 32 bit and 64 bit Clang-based compilers for Delphi's sister product, C++Builder, do use a multiplicatie inverse for a division by a constant. Delphi doesn't (yet). – Rudy Velthuis Oct 29 '18 at 12:12
  • 1
    I am sorry it is not Pentagon but Pentium CPU assembly language instructions. I simply want to make the above call into an inline statements for faster execution. Using inline will produce larger code but will be faster because no PUSH and POP instructions will be used – Omid Motahed Oct 29 '18 at 12:21
  • I am using Delphi 7 . I use DivMod to convert a large array of integers into displayable format. My large integer calculator written in Delphi 7 can be downloaded from http://omidmotahed.com/highpowersource.zip – Omid Motahed Oct 29 '18 at 13:01
  • FWIW, you can't inline assembler in Delphi. But note that the "efficient 32 bit DivMod" LURD linked to is faster in Win64, but not in Win32. I wrote a 32 bit fast implementation an hour ago, at home (i.e. for Win32), but I don't have it here. I can post it this evening. – Rudy Velthuis Oct 29 '18 at 14:00
  • Rudy I like to see your code. I was able to remove the constant divisor 10 and further optimize it by introducing 2 global variables K1, K2 var L1,L2 : WORD; procedure DivMod2(Dividend: Integer); ASM MOV EBX,10 MOV EDX,EAX SHR EDX,16 DIV BX MOV L1,AX MOV L2,DX end; – Omid Motahed Oct 29 '18 at 14:12
  • 1
    It's as if you aren't reading any of this. Don't use `DIV`. It's slower than multiply and shift. – David Heffernan Oct 29 '18 at 15:31
  • 1
    @Omid: Delphi uses register calling convention, and that doesn't push or pop its parameters, only the return address. But if you inline, the generated code may have to push or pop anyway (or store some registers on the stack), to free up the registers for the inlined code. Inlining can be an advantage, but doesn't have to be. Don't expect too much of it. Some code, e.g. code using managed types, and therefore requiring a hidden try-finally around it, can even make your code slower, if inlined. – Rudy Velthuis Oct 29 '18 at 16:57
  • @Omid: if you really want to speed up your code, a non-inlined assembler routine can be much faster than an inlined pure Pascal routine. But to see where you can optimize pure Pascal, you'll have to take a look at the generated code in the CPU window and try to trick the compiler. That is not always possible. In assembler, you can use registers as you like. That can be much better, if you do it right. Sometimes it can even make sense to use 16 bit types, in Win32 (and 32 bit types in Win64). Pure Pascal also has problems with code requiring a carry. For that, assembler is much easier. – Rudy Velthuis Oct 29 '18 at 17:01
  • Thanks for your comments and suggestion. My code is now run much faster with your suggestions. – Omid Motahed Oct 30 '18 at 01:11
  • Why do you keep reverting typo fixes? [*bellow* is a different English word](https://www.google.ca/search?q=bellow+definition), and it's x86, not Pentagon. (Pentium is an obsolete x86 CPU from 20 years ago, and also a brand name Intel still uses for low-end CPUs. It makes no sense to use it here unless you're *actually* optimizing the asm for P5 Pentium from 1993, a dual-issue in-order microarchitecture.) – Peter Cordes Oct 30 '18 at 04:41
  • Sorry the edit box was offered to me and I though I must edit my typo mistake. This system is far different than Facebook messaging that I mostly use and confusing to me. But far better experts here – Omid Motahed Oct 30 '18 at 07:57

4 Answers4

6

By far the most important optimization you can do is use a fixed-point multiplicative inverse for division by a compile-time constant: Why does GCC use multiplication by a strange number in implementing integer division?.

Any decent C compiler will do that for you, but apparently Delphi won't, so there is a valid reason for doing it with asm.

Can you return a value in EAX instead of storing both the quotient and remainder to memory? Seems like a waste to pass 2 pointer args, and force the caller to retrieve the value from memory. (Update, yes I think you can by making it a function instead of a procedure; I'm just blindly modifying Delphi code from other answers, though.)

Anyway, fortunately we can get a C compiler to do the hard work of figuring out the multiplicative inverse and the shift counts for us. We can even get it to use the same "calling convention" that it looks like Delphi is using for inline asm. GCC's regparm=3 32-bit calling convention passes args in EAX, EDX, and ECX (in that order).

You might want to make a separate version for cases where you only need the quotient, because (unlike the slow div instruction), you have to compute the remainder separately as x - (x/y)*y if you're using a fast multiplicative inverse. But yes that's still about twice to 4x as fast on modern x86.

Or you could leave the remainder calculation to be done in pure Delphi, unless the compiler is just terrible at optimizing in general.

#ifdef _MSC_VER
#define CONVENTION  _fastcall   // not the same, but 2 register args are better than none.
#else
#define CONVENTION __attribute__((regparm(3)))
#endif

// use gcc -Os to get it to emit code with actual div.

divmod10(unsigned x, unsigned *quot, unsigned *rem) {
    unsigned tmp = x/10;
    // *quot = tmp;
    *rem = x%10;
    return tmp;
}

From the Godbolt compiler explorer:

# gcc8.2  -O3 -Wall -m32
div10:    # simplified version without the remainder, returns in EAX
        mov     edx, -858993459     # 0xCCCCCCCD
        mul     edx                 # EDX:EAX = dividend * 0xCCCCCCCD
        mov     eax, edx
        shr     eax, 3
        ret
      # quotient in EAX

# returns quotient in EAX, stores remainder to [ECX]
# quotient pointer in EDX is unused (and destroyed).
divmod10:
        mov     edx, -858993459
        push    ebx
        mov     ebx, eax
        mul     edx                      # EDX:EAX = dividend * 0xCCCCCCCD
        mov     eax, edx
        shr     eax, 3
        # quotient in EAX = high_half(product) >> 3 = product >> (32+3)
        lea     edx, [eax+eax*4]         # EDX = quotient*5
        add     edx, edx                 # EDX = quot * 10
        sub     ebx, edx                 # remainder = dividend - quot*10
        mov     DWORD PTR [ecx], ebx     # store remainder
        pop     ebx
        ret
        # quotient in EAX

This is C compiler output. Adapt as necessary to Delphi inline asm; the inputs are in the right registers for Delphi, I think.

If Delphi inline-asm doesn't let you clobber EDX, you can save/restore it. Or you want to remove the unused quotient pointer input, then you can adjust the asm, or adjust the C on Godbolt and look at the new compiler output.

This is more instructions than with div, but div is very slow (10 uops, and 26 cycle latency even on Skylake.)

If you have a 64-bit integer type in Delphi, you can do this in Delphi source and avoid inline asm. Or as MBo shows, you can use $CCCD as a multiplicative inverse for inputs that are in the 0..2^16-1 range using only 32-bit integer types.

For the remainder, the store/reload round trip (4 to 5 cycles) has similar latency to the actual calculation on a recent Intel CPU with mov-elimination (3 + 1 to quotient, + another 3 for the lea/add/sub = 7), so having to use inline asm for this is pretty crap. But it's still better than a div instruction for latency and throughput. See https://agner.org/optimize/ and other performance links in the x86 tag wiki.


Delphi version you can copy/paste

(If I got this right, I don't know Delphi, and just copied+modified examples here on SO and this site, based on what I infer about the calling-convention / syntax)

I'm not sure I got the arg-passing right for inline-asm. This RADStudio documentation says "Except for ESP and EBP, an asm statement can assume nothing about register contents on entry to the statement." But I'm assuming args are in EAX and EDX.

Using asm for 64-bit code might be silly, because in 64-bit you can efficiently use pure Pascal for 64-bit multiplication. How do I implement an efficient 32 bit DivMod in 64 bit code. So in the {$IFDEF CPUX64} blocks, the best choice might be pure pascal using UInt64(3435973837)*num;

function Div10(Num: Cardinal): Cardinal;
{$IFDEF PUREPASCAL}
begin
  Result := Num div 10;
end;
{$ELSE !PUREPASCAL}
{$IFDEF CPUX86}
asm
        MOV     EDX, $CCCCCCCD
        MUL     EDX                   // EDX:EAX = Num * fixed-point inverse
        MOV     EAX,EDX               // mov then overwrite is ideal for Intel mov-elimination
        SHR     EAX,3
end;
{$ENDIF CPUX86}
{$IFDEF CPUX64}
asm
         // TODO: use pure pascal for this; Uint64 is efficient on x86-64
        // Num in ECX, upper bits of RCX possibly contain garbage?
        mov     eax, ecx              // zero extend Num into RAX
        mov     ecx, $CCCCCCCD        // doesn't quite fit in a sign-extended 32-bit immediate for imul
        imul    rax, rcx              // RAX = Num * fixed-point inverse
        shr     rax, 35               // quotient = eax
end;
{$ENDIF CPUX64}
{$ENDIF}

 {Remainder is the function return value}
function DivMod10(Num: Cardinal; var Quotient: Cardinal): Cardinal;
{$IFDEF PUREPASCAL}
begin
  Quotient := Num div 10;
  Result := Num mod 10;
end;
{$ELSE !PUREPASCAL}
{$IFDEF CPUX86}
asm
    // Num in EAX,  @Quotient in EDX
    push    esi
    mov     ecx, edx           // save @quotient
    mov     edx, $CCCCCCCD
    mov     esi, eax           // save dividend for use in remainder calc
    mul     edx                // EDX:EAX = dividend * 0xCCCCCCCD
    shr     edx, 3             // EDX = quotient
    mov     [ecx], edx         // store quotient into @quotient

    lea     edx, [edx + 4*edx] // EDX = quot * 5
    add     edx, edx           // EDX = quot * 10
    mov     eax, esi                  // off the critical path
    sub     eax, edx           // Num - (Num/10)*10
    pop     esi
    // Remainder in EAX = return value
end;
{$ENDIF CPUX86}
{$IFDEF CPUX64}
asm
        // TODO: use pure pascal for this?  Uint64 is efficient on x86-64
    // Num in ECX,   @Quotient in RDX
    mov     r8d, ecx          // zero-extend Num into R8
    mov     eax, $CCCCCCCD
    imul    rax, r8
    shr     rax, 35           // quotient in eax

    lea     ecx, [rax + 4*rax]
    add     ecx, ecx          // ecx = 10*(Num/10)
    mov     [rdx], eax        // store quotient

    mov     eax, r8d          // copy Num again
    sub     eax, ecx          // remainder = Num - 10*(Num/10)
    // we could have saved 1 mov instruction by returning the quotient
    // and storing the remainder.  But this balances latency better.
end;
{$ENDIF CPUX64}
{$ENDIF}

Storing the quotient and returning the remainder means that both might be ready at about the same time in the caller, because the extra latency of computing the remainder from the quotient overlaps with the store-forwarding. IDK if that's good, or if having out-of-order execution get started on some work based on the quotient is more often better. I'm going to guess that if you call DivMod10, you might only want the remainder.

But in a split-into-decimal-digits loop that repeatedly divides by 10, it's the quotient that forms the critical path, so a version of this that returned the quotient and stored the remainder would be a much better choice there.

In that case you'd make the quotient the return value in EAX, and rename the function arg to the remainder.


The asm is based on clang output for this version of this C function (https://godbolt.org/z/qu2kvV), targeting the Windows x64 calling convention. But with some tweaks to make it more efficient, e.g. taking mov off the critical path, and using different registers to avoid REX prefixes. And replacing one LEA with just an ADD.

unsigned divmod10(unsigned x, unsigned *quot) {
    unsigned qtmp = x/10;
    unsigned rtmp = x%10;
     *quot = qtmp;
     //*rem = rtmp;
    return rtmp;
}

I used clang's version instead of gcc's because imul r64,r64 is faster on Intel CPUs and Ryzen (3 cycle latency / 1 uop). mul r32 is 3 uops, and only 1 per 2 clocks throughput on Sandybridge-family. I think the multiply hardware naturally produces a 128-bit result, and splitting the low 64 of that into edx:eax takes an extra uop, or something like that.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • The c compiler came up with almost the identical code I had. I just used the Delphi example of other DivMods in Delphi, storing both q and r in reference parameters, so I had to shuffle registers a little more. – Rudy Velthuis Oct 29 '18 at 14:03
  • FWIW, your Delphi code looks fine to me. For someone not knowing the language, kudos! Hmmm... not sure if your {$IFDEF} {$ELSE} all have their own {$ENDIF}, but that can easily be checked by the compiler. And the comma in the param list should be a semicolon. I already edited that. – Rudy Velthuis Oct 29 '18 at 19:05
  • @RudyVelthuis: http://docwiki.embarcadero.com/RADStudio/Tokyo/en/Using_Inline_Assembly_Code had a nice example to copy/paste from, combined with your answer to get the function signature right. :) – Peter Cordes Oct 29 '18 at 19:08
  • still, not bad for someone who doesn't know Delphi. – Rudy Velthuis Oct 29 '18 at 19:10
  • 2
    `unless the compiler is just terrible at optimizing in general...` yes. Yes, it is. – J... Oct 29 '18 at 19:47
  • Just some timings among the contributions:32 bit -------------------- DivMod10_64 :3827 DivMod10_Cord :1666 DivMod10_J :1660 DivMod10_XMM :1661 DivMod10_Rudy :1944 64 bit -------------------- DivMod10_64 :2108 DivMod10_Cord :2118 – LU RD Oct 29 '18 at 20:55
  • 1
    Conclusion: For 64 bit, the linked pascal example is probably to prefer. For 32 bit, take your pick. – LU RD Oct 29 '18 at 20:58
  • @LURD: what CPU was that on? \@J... found that the XMM version was measurably fastest for 32-bit on Haswell. – Peter Cordes Oct 29 '18 at 21:03
  • Intel i7-4790, so that should be an Haswell. – LU RD Oct 29 '18 at 21:08
  • 2
    @Peter I fixed a couple of compilation errors. The code seems to work correctly for the 32 bit version, but the 64 bit version of `Div10` is a little broken. I'll look into what's up. – David Heffernan Oct 29 '18 at 21:11
  • OK, so in the x64 version of `Div10` the argument is passed in ecx which you immediately overwrite with eax. I guess you can fix that by swapping eax and ecx in the first two lines. But I'm not sure what you mean about not assuming upper 32 bits are zero. Can you assume that? – David Heffernan Oct 29 '18 at 21:19
  • and the x64 version of `DivMod10` is also broken because again it is coded such that the arg arrives in eax but it arrives in ecx. I don't have time to look into this further, sorry. – David Heffernan Oct 29 '18 at 21:21
  • @DavidHeffernan: thanks, fixed the x64 `Div10`, I forgot to correct that after reading the docs for x64 asm while working on `DivMod10`. `DivMod10` *does* assume the args arrives in ECX, RDX, that's why it does `mov r8d, ecx` as the first instruction. If it's broken, it's for a different reason. >.< The docs I found said something about not assuming anything in registers for *inline* asm; can this Pascal function containing a single asm statement inline the way C with inline asm can with both MSVC and GNU-compatible compilers? I should be using `mov r8d, Num`? – Peter Cordes Oct 29 '18 at 21:37
  • OK, the comment said eax not ecx but yeah, the code seems fine. – David Heffernan Oct 29 '18 at 22:11
  • Right, I found and fixed the bug in the x64 `DivMod10` – David Heffernan Oct 30 '18 at 13:31
  • @DavidHeffernan: Ah right, thanks. Definitely the kind of bug that can sneak in when coding without testing. >. – Peter Cordes Oct 30 '18 at 22:28
2

Following on from this answer, you can get some of the performance back in a 32-bit compile by leveraging a hardware 32x32 -> 64bit multiply using SSE:

program Project1;
{$APPTYPE CONSOLE}

uses
  Windows, SysUtils;

procedure DivMod10(num : Cardinal; var q, r : Cardinal);
const
  m : cardinal = 3435973837;
asm
  movd xmm0, m         {move magic number to xmm0}
  movd xmm1, eax       {move num to xmm1}
  pmuludq xmm0, xmm1   {xmm0[0:32] * xmm1[0:32] -> xmm0[0:64] unsigned}
  psrlq xmm0, 35       {right shift xmm0}
  movss [edx], xmm0    {store quotient to q}
  movd edx, xmm0       {recycle edx, store q}
  imul edx, -$A        {edx = q * (-10)}
  add edx, eax         {edx = r}
  mov [ecx], edx       {store r}
end;

var
  q, r, t0, i : cardinal;
begin
  t0 := GetTickCount;
  for I := 1 to 999999999 do DivMod10(i, q, r);
  WriteLn('SSE ASM : ' + IntToStr(GetTickCount - t0));

  t0 := GetTickCount;
  for I := 1 to 999999999 do q := i div 10;
  WriteLn('div : ' + IntToStr(GetTickCount - t0));

  WriteLn('Test correctness...');
  for I := 1 to High(Cardinal) do begin
    DivMod10(i,q,r);
    if (q <> (i div 10)) or (r <> (i mod 10)) then
      WriteLn('Incorrect Result : ' + IntToStr(i));
  end;

  WriteLn('Test complete.');
  Readln;
end.

This produces :

SSE ASM : 2449
div : 3401
Test correctness...
Test complete.

This is not generally safe since you should be checking at runtime if the CPU supports the required SSE instructions (and have a purepascal alternative in place for that case), it is nevertheless increasingly rare to find CPUs alive and working that are old enough to not support at least SSE2.

For systems that do support it, this can be more performant than div (I see about a 25% performance benefit using DivMod10 on Haswell, for example), and you get the remainder. Not as fast as a native 64-bit IMUL, but still quite useful.


To address Peter's comments, consider the pure x86 version :

procedure DivMod10(num : Cardinal; var q, r : Cardinal);
const
  m : cardinal = 3435973837;
asm
  push eax
  push edx
  mul m
  mov eax, edx
  shr eax, 3
  pop edx
  mov [edx], eax
  pop eax
  imul edx, [edx], -$A
  add edx, eax
  mov [ecx], edx
end;

which produces (for me - Haswell i7) :

x86 ASM : 2948
div : 3401
Test correctness...
Test complete.

Which is about 18% slower than the SSE version.


With some good ideas from Peter, we can optimize the pure x86 version a bit further, saving a register by converting to a function and replacing the immediate imul with lea and add :

function DivMod10(Num: Cardinal; var Quotient: Cardinal): Cardinal;
const
  m : cardinal = 3435973837;
asm
  mov ecx, eax           {save num to ecx}
  push edx               {save quotient pointer}
  mul m                  {edx:eax = m*Num}
  shr edx, 3             {edx = quotient}
  pop eax                {restore quotient pointer}
  mov [eax], edx         {store quotient}
  mov eax, ecx           {restore num to eax}
  lea ecx, [edx +4*edx]  {ecx = q*5}
  add ecx, ecx           {ecx = q*10}
  sub eax, ecx           {return remainder in eax}
end;

This gets execution time (same conditions as above) down to 2637ms, but still not quite as quick as the SSE version. The imul to lea optimization is minor and optimizes latency over throughput - this can be applied to all algorithms depending on the end use environment.

J...
  • 30,968
  • 6
  • 66
  • 143
  • scalar integer `mul` gives you the high half in EDX, is lower latency, and doesn't need to round-trip the data from integer to XMM regs. (And why do you use `movd` for one direction, but store/reload for the other? And if you are, why not fold the load with `imul edx, [edx], -10`? x86 imul-immediate has a separate source and dest (http://felixcloutier.com/x86/IMUL.html), unlike most immediate instructions which use the 3-bit `/r` source field as extra opcode bits). – Peter Cordes Oct 29 '18 at 17:08
  • Anyway, this looks strictly worse than the code in my answer on this question, on all CPUs. (edit welcome to add or change one of the code blocks into copy-pastable Delphi syntax.) – Peter Cordes Oct 29 '18 at 17:11
  • @PeterCordes I'd done this before and remember thinking `mul` was slower for 32-bit. I've tested and this seems to be the case, but not by much. `movd` is used for moves in and `movss` for moves out because of operand size issues. I'm not sure what you'd suggest otherwise? You are correct that `imul` with an immediate for edx recycling would be faster. I'm updating this shortly... – J... Oct 29 '18 at 17:25
  • `mul r32` on Haswell is 3 uops with 4c latency. (https://agner.org/optimize/). Amusingly slower than `mul r64`, I guess because the multiply unit naturally produces 128-bit output and it has to split the low half with an extra uop, as well as write to both EAX and EDX. PMULUDQ is 1 uop with 5c latency, but the extra uops you spend on data transfer balance it out. Anyway, I would have done `movss [edx], xmm0` / `movd edx, xmm0` to store the quotient then overwrite the pointer. (or `movd` for both, but yes `movss` is one byte shorter and is also a dword store) – Peter Cordes Oct 29 '18 at 17:30
  • @PeterCordes Yes, 32x32->64 imul is slow in 32-bit mode. `movd edx, xmm0` is very slightly (~1%) faster... I'll take it. Thanks. – J... Oct 29 '18 at 17:36
  • Your pure-integer version has poor latency because it uses push/pop to store/reload the inputs, so there's store-forwarding latency on the critical path, probably higher latency than the `mul` / `shr`. You could `mov` to ECX because you still have it free. This is why compilers (and the other answers) save / restore EBX or ESI and mov to/from it. Also, your integer version uses `imul`-immediate instead of LEA+ADD to multiply by 10, so it competes with `mul` for port 1, vs. `pmuluq` using port 0. Probably not a factor; the front-end is a bigger bottleneck than port 1 with this many uops. – Peter Cordes Oct 29 '18 at 19:06
  • I updated my answer with Delphi versions of the asm. They will hopefully benchmark faster than your version. They're optimized for latency, thus using LEA+ADD instead of `imul`-immediate to multiply by 10. – Peter Cordes Oct 29 '18 at 19:08
  • @PeterCordes `ECX` is not free, the pointer to `r` is passed in ECX (register convention). Your version saves one push/pop (pure x86 ASM) but runs within 1% of the same time as mine. (2918ms vs. 2948). It *is* a very small amount faster, but effectively you could consider it nearly identical. – J... Oct 29 '18 at 19:15
  • I noticed your use of ECX too late to edit my last comment. Interesting results, thanks for testing. It looks like if throughput is what matters, yeah it's worth using SSE2 on Intel just because of having more scratch registers, and avoiding 3-uop `mul` in 32-bit code. If part of a long dependency chain, it has worse latency, so it depends on the use-case. – Peter Cordes Oct 29 '18 at 19:19
  • @PeterCordes Yeah, I tried a lot of variations with `lea` instead of `imul`, returning the remainder in `eax` and freeing up `ecx`... nothing was faster than the code above, as you say, in a pure throughput type of test. I feel that overall the differences between most of these methods will be small in practice. As always with optimization, it's really up to OP to decide what "fast enough" is and how far to optimize for specific platforms or workloads. – J... Oct 29 '18 at 19:52
  • Your last `add eax, ecx {return remainder in eax}` needs to be a `sub`, otherwise you could use `lea eax, [ecx + eax*2]`. You're multiplying by 10, not -10 like you were with `imul`. (Using `imul`-immediate may be a throughput win, at the cost of an extra cycle of latency. Or worse on AMD Bulldozer-family.) – Peter Cordes Oct 29 '18 at 19:52
  • And yeah, if the OP wants more performance, fold more stuff into the asm to optimize it all. (Or into a C function for a good compiler to optimize). If the Delphi compiler is as bad as it sounds, writing whole loops in asm is more likely to be a win, or letting a C compiler make them for you. – Peter Cordes Oct 29 '18 at 19:56
  • @PeterCordes re: the `sub` - yes, my bad. Fixed, thanks. – J... Oct 29 '18 at 20:38
1

Gain does exist, but is it significant for real tasks? (note I've changed arguments types)

procedure DivMod10(Dividend: DWord; var Result, Remainder: DWord);
asm
        PUSH    EDI
        PUSH    ESI
        MOV     EDI, EDX
        MOV     ESI, 10
        XOR     EDX, EDX
        DIV     ESI
        MOV     [ECX], EDX
        MOV     [EDI], EAX
        POP     ESI
        POP     EDI
end;

  1 000 000 000 iterations
  divmod10: 4539
  math.divmod: 7145

The fastest way for limited range - Delphi code using multiplication as @Peter Cordes proposed. Assembly code is slower (1777) perhaps due to function call and my weak assembly experience.

  b := a * $CCCD;
  b := b shr 19;   //result
  c := a - b * 10;  //remainder
  1 000 000 000 iterations: 1200 ms  (560 ms without remainder)

Using constant from this SO answer allows to get rid off shifts, but time is worse that Peter's and J's versions:

function DM10(Dividend: DWord;  var Remainder: DWord): DWord;
asm
   push ebx
   mov ebx, eax
   mov ecx, edx
   mov edx, 1999999Ah
   mul eax, edx
   push edx
   lea eax, [edx+edx*4]
   add eax, eax
   sub ebx, eax
   mov [ecx], ebx
   pop eax
   pop ebx
end;

Timings for my machine (10^9 iterations, haswell i5-4670):
this  DM10               2013
Peter Cordes DivMod10    1755
J... SSE version         1685
MBo
  • 77,366
  • 5
  • 53
  • 86
  • Tempted to downvote for failing to use [Why does GCC use multiplication by a strange number in implementing integer division?](https://stackoverflow.com/q/41183935) to optimize for a constant divisor. Any C compiler will do *much* better. https://godbolt.org/z/9a0mW5 – Peter Cordes Oct 29 '18 at 05:45
  • Interesting idea to use a multiplicative inverse in Delphi source. That's great as long as it's ok to limit the input range to 16-bit, unlike the OP's code where the dividend is a 32-bit integer. (But `div bx` would fault if dividend / 10 overflowed 16-bit AX, so that's close to safe if it was intended.) – Peter Cordes Oct 29 '18 at 06:29
  • @Peter: is possible in Delphi for Cardinal (= UInt32) too, but then you must expand the expression to UInt64, which is slow again (in 32 bit Delphi). Works great in 64 bit Delphi, though. – Rudy Velthuis Oct 29 '18 at 14:06
  • Note that the answer you linked is not correct. Compilers don't do that, because it doesn't work for the full range of possible inputs. See my comment on that answer [Divide by 10 using bit shifts?](https://stackoverflow.com/posts/comments/80585843). But if limited range is ok, then yes that's a good choice. (works for positive integers up to `107374183`, rounds incorrectly for negative inputs. I didn't test an unsigned version) – Peter Cordes Oct 30 '18 at 07:52
  • @Peter Cordes Yes, I must add this warning. In any case, function is slower than best ones. – MBo Oct 30 '18 at 07:58
1

Ok, here is my try:

procedure DivMod10(Num: Cardinal; var Quotient, Remainder: Cardinal);
asm
        PUSH    ESI
        PUSH    EDI
        MOV     EDI,EAX          // Num
        MOV     ESI,EDX          // @Quotient
        MOV     EDX,$CCCCCCCD    
        MUL     EDX              // EDX:EAX = EAX * magic_number
        SHR     EDX,3
        MOV     [ESI],EDX        // --> @Quotient
        LEA     EDX,[EDX+4*EDX]
        ADD     EDX,EDX          // Quotient * 10 
        SUB     EDI,EDX          // Num - Quotient*10
        MOV     [ECX],EDI        // --> @Remainder 
        POP     EDI
        POP     ESI
end;

If you don't need the remainder:

function Div10(Num: Cardinal): Cardinal;
asm
        MOV     ECX,$CCCCCCCD    
        MUL     ECX   
        SHR     EDX,3
        MOV     EAX,EDX
end;
Rudy Velthuis
  • 28,387
  • 5
  • 46
  • 94
  • 32 bit Windows only I guess. What's wrong with a Pascal implementation? That could even be inlined. – David Heffernan Oct 29 '18 at 15:31
  • @David: Because a 32 bit Pascal implementation either uses simple div and mod, which is slow, or it needs UInt64, which is slow too (much register shuffling, pushing & popping, calling __llmul, etc.). In Win32, because EAX:EDX contain the 64 bit result, and you need the top of the 64 bits, you can be much faster in asm. In Win64, it would be much easier, even in PUREPASCAL. – Rudy Velthuis Oct 29 '18 at 16:05
  • But if you cared about performance you wouldn't be writing 32 bit code – David Heffernan Oct 29 '18 at 16:06
  • @David: Depends. I usually write my (library) code thus that it can be used in Win32 and Win64, but also for the Mac (32 bit only, at the moment) and even the other platforms, if that makes sense. If this was for my purposes only, and if it had to be performant, I would indeed use 64 bit. – Rudy Velthuis Oct 29 '18 at 16:46
  • @David: it may surprise you, but there are still people using Win32. This computer, here in my clinic, and 5 of the others, still run on a 32 bit Windows Professional. Only a few run Win64 Pro. And the 32 bitters are perhaps 3 years old. The (extremely extensive and extremely expensive) software I run here is still 32 bit too. – Rudy Velthuis Oct 29 '18 at 16:50
  • @Davbid: just think of all those who still use D7. – Rudy Velthuis Oct 29 '18 at 16:53
  • If you care about performance, you target 64 bit, like I said. For sure make it cross platform. You didn't do that here. – David Heffernan Oct 29 '18 at 16:55
  • 1
    I care about performance on all platforms. If an entire program is dedicated to performance, then indeed, I would choose a fast platform. But library code should be as fast as can be too, no matter on which platform it is used. This DivMod10 looks as if it might be used to convert integers to text, perhaps even large integers. I have some experience with that. And my code runs as fast as possible on all platforms, either in Pure Pascal, or in assembler. – Rudy Velthuis Oct 29 '18 at 17:02
  • The code in this answer only works in 32 bit Windows, perhaps 32 bit Mac, not sure what the abi is there. – David Heffernan Oct 29 '18 at 17:16
  • EDX is not an input to `mul` ([When and why do we sign extend and use cdq with mul/div?](https://stackoverflow.com/q/36464879)). You can use it to hold the `$CCCCCCCD` constant, like compilers do. This would let you avoid save/restore of EBX. (Does inline asm implicitly clobber EAX,EDX, and EDX, but you need to save/restore other regs yourself? Like it's a function? Or is this a stand-alone function with a `ret` added by the compiler?) You might want to use the compiler output from my answer. – Peter Cordes Oct 29 '18 at 17:18
  • @PeterCordes: you are right. EDX contains the address of the quotient, so it must be saved anyway. But I don't need EBX. I'll change the code. – Rudy Velthuis Oct 29 '18 at 17:34
  • @PeterCordes: the assembler adds the RET and, if required, a prolog and epilog (but not required here). – Rudy Velthuis Oct 29 '18 at 17:39
  • @DavidHeffernan: I added an update to my answer with `{$IFDEF CPUX64}` and `{$IFDEF CPUX86}`, with a function version of divmod10 which returns the remainder, saving instructions and latency. Can you guys that actually know Delphi do a sanity check on it? – Peter Cordes Oct 29 '18 at 18:53
  • @J... Not really. There is only a prolog/epilog if it is needed, i.e. to accomodate for declared temporary variables. Otherwise, there is none. But you can't inline them, indeed. – Rudy Velthuis Oct 30 '18 at 07:40