efficient way to divide a very large number stored in 2 registers by a constant

Question

Let's say I want to calculate the following:

A/Z

Where A is of length 128 bit and Z is 64 bit long. A is stored in 2 64 bit registers since the registers of the system can store up to 64 bits. What would be an efficient way to calculate the result?

P.S: I've solved similar multiplication problems by using CSD representations. However, this would require calculating 1/Z first.

I'm open to all suggestions. But I would prefer a solution that would stick to adding, subtracting an shifting. — H_squared, Oct 14 '13 at 09:32
See [this question](http://stackoverflow.com/questions/4771823/64-32-bit-division-on-a-processor-with-32-16-bit-division) — Antti Huima, Oct 14 '13 at 09:53
@AnttiHuima Thanks. That helps a lot. Currently checking the hacker's delight multiword division. — H_squared, Oct 14 '13 at 10:03
There might be something, but I don't know of anything better with just adding/shifting.. — harold, Oct 14 '13 at 10:10
I just added an answer for you. If you want to use binary/long or aproximation division let me know but i think sub division on half the bit size is the best option for you (unless you want to go to bigints in the future) — Spektre, Oct 15 '13 at 12:24
[Unsigned 128-bit division on 64-bit machine](https://stackoverflow.com/q/1870158/995714), [128-bit division intrinsic in Visual C++](https://stackoverflow.com/q/8453146/995714) — phuclv, Aug 11 '18 at 04:50

Spektre · Answer 1 · 2021-03-07T13:40:31.703

[Edit1] spotted bug repaired

I assume you want integer division so here is the math for 8bit analogy:

A = { a0 + (a1<<8) }
D = { d0 + (d1<<8) } ... division result
Z = { z0 }
D = (a0/z0) + ((a1*256)/z0) +                      (( (a0%z0) + ((a1*256)%z0) )/z0);
D = (a0/z0) + ((a1/z0)*256) + ((a1%z0)*(256/z0)) + (( (a0%z0) + ((a1%z0)*(256%z0)) )/z0);

Now the terms 256/z0 and 256%z0 can be computed like this (C++):

    i0=0xFF/z0; if ((z0&(z0-1))==0) i0++;   // i0 = 256/z0
    i1=i0*z0; i1^=0xFF; i1++;               // i1 = 256%z0

So the i0 is just incremented in case the z0 is power of 2, and i1 is just remainder computed from the division.

    a/b = d + r/b
    r = a - a*d

Here tested 8bit code:

//---------------------------------------------------------------------------
// unsigned 8 bit ALU in C++
//---------------------------------------------------------------------------
BYTE cy;                                                    // carry flag cy = { 0,1 }
void inc(BYTE &a);                                          // a++
void dec(BYTE &a);                                          // a--
void add(BYTE &c,BYTE a,BYTE b);                            // c = a+b
void adc(BYTE &c,BYTE a,BYTE b);                            // c = a+b+cy
void sub(BYTE &c,BYTE a,BYTE b);                            // c = a-b
void sbc(BYTE &c,BYTE a,BYTE b);                            // c = a-b-cy
void mul(BYTE &h,BYTE &l,BYTE a,BYTE b);                    // (h,l) = a/b
void div(BYTE &h,BYTE &l,BYTE &r,BYTE ah,BYTE al,BYTE b);   // (h,l) = (ah,al)/b ; r = (ah,al)%b
//---------------------------------------------------------------------------
void inc(BYTE &a) { if (a==0xFF) cy=1; else cy=0; a++; }
void dec(BYTE &a) { if (a==0x00) cy=1; else cy=0; a--; }
void add(BYTE &c,BYTE a,BYTE b)
    {
    c=a+b;
    cy=BYTE(((a &1)+(b &1)   )>>1);
    cy=BYTE(((a>>1)+(b>>1)+cy)>>7);
    }
void adc(BYTE &c,BYTE a,BYTE b)
    {
    c=a+b+cy;
    cy=BYTE(((a &1)+(b &1)+cy)>>1);
    cy=BYTE(((a>>1)+(b>>1)+cy)>>7);
    }
void sub(BYTE &c,BYTE a,BYTE b)
    {
    c=a-b;
    if (a<b) cy=1; else cy=0;
    }
void sbc(BYTE &c,BYTE a,BYTE b)
    {
    c=a-b-cy;
    if (cy) { if (a<=b) cy=1; else cy=0; }
    else    { if (a< b) cy=1; else cy=0; }
    }
void mul(BYTE &h,BYTE &l,BYTE a,BYTE b)
    {
    BYTE ah,al;
    h=0; l=0; ah=0; al=a;
    if ((a==0)||(b==0)) return;
    // long binary multiplication
    for (;b;b>>=1)
        {
        if (BYTE(b&1))
            {
            add(l,l,al);    // (h,l)+=(ah,al)
            adc(h,h,ah);
            }
        add(al,al,al);      // (ah,al)<<=1
        adc(ah,ah,ah);
        }
    }
void div(BYTE &d1,BYTE &d0,BYTE &r,BYTE a1,BYTE a0,BYTE z0)
    {
    //      D = (a0/z0) + ((a1*256)/z0) +                      (( (a0%z0) + ((a1*256)%z0) )/z0);
    //      D = (a0/z0) + ((a1/z0)*256) + ((a1%z0)*(256/z0)) + (( (a0%z0) + ((a1%z0)*(256%z0)) )/z0);
    // edge cases
    if (z0==0){ d0= 0; d1= 0; r=0; }
    if (z0==1){ d0=a0; d1=a1; r=0; }
    // normal division
    if (z0>=2)
        {
        BYTE i0,i1,e0,e1,f0,f1,t,dt;
        i0=0xFF/z0; if ((z0&(z0-1))==0) i0++;   // i0 = 256/z0
        i1=i0*z0; i1^=0xFF; i1++;               // i1 = 256%z0
        t=a1%z0;
        mul(e1,e0,t,i0);                        // e = (a1%z0)*(256/z0)
        mul(f1,f0,t,i1);                        // f = (a1%z0)*(256%z0)
        add(f0,f0,a0%z0);                       // f = (a0%z0) + (a1%z0)*(256%z0)
        adc(f1,f1,0);
        add(d0,a0/z0,e0);
        adc(d1,a1/z0,e1);
        // t = division of problematic term by z0
        t=0;
        for (;f1;)
            {
            dt=f1*i0;
            mul(e1,e0,dt,z0);
            sub(f0,f0,e0);
            sbc(f1,f1,e1);
            t+=dt;
            }
        if (f0>=z0) t+=f0/z0;
        // correct output
        add(d0,d0,t);
        adc(d1,d1,0);
        // remainder
        r=d0*z0;
        r=a0-r;
        }
    }
//---------------------------------------------------------------------------

The 8bit ALU is not optimized at all I just busted it to test it right now as original project is nowhere to found... I assume you are doing it in asm so you use can use CPU/ALU instructions carry instead. The only important function is the div.

Notes:

This is only 8 bit. To convert it to 64 bit just change all 0xFF to 0xFFFFFFFFFFFFFFFF and BYTE to your data type and <<8 to <<64.

Division result is in d0, d1 and remainder is in r Code does not handle negative values.

Sadly the term:

(( (a0%z0) + ((a1%z0)*(256%z0)) )/z0);

in its current state requires also 16 bit division (not full though as result is not arbitrary instead a composite of two mod z0 values). I managed to avoid long division by few (for 16bit:8bit is the worst case 7) iterations. However my guts are telling me it should be computed simpler using some modular math identity I do not know or cant think of right now. This makes this division relatively slow.

`(a0/z0) + ((a1<<64)/z0)` this won't work, because dividing the most significant byte by a constant might deliver an irrational number. Let's say `a1=1` and `z0=3`, then `a1/z0 = 0.333333333---` , in this case no matter how many times you shift this number, you will always end up with the wrong answer. Check my answer below — H_squared, Oct 18 '13 at 10:54
My approach is 100% working and tested ... for unsigned integers. No issues at all... if it should handle fractional parts then it should be mentioned somewhere (for example floating/fixed point division and not that values are 128-bit and in 64-bit registers which usually implies integer arithmetics) — Spektre, Oct 18 '13 at 19:34
have it tested on 8+8/8 bit , 16+16/16 bit all combinations without any errors,... and it uses only 3 half bits divisions and 2 full bits (or 4 half bits) additions no loops as you can see — Spektre, Oct 18 '13 at 19:40
Fully tested, really? With a 8-bits implementation, I get 2 instead of 4 for 0x0200 / 0x80 (and 0x20000000ULL / 0x80000000UL, for the 32-bits implementation), and 6 instead of 8 for 0x0304 / 0x56. But lot of others combinations are correct (including your 0x123456789ABCDEF0ULL / 0x23, which returns 0xC297AE99UL for d0), so, my C implementation of your code seems correct. — VTiTux, Mar 06 '21 at 20:34
@VTiTux +1 nice catch not sure what happened back then but most likely my tests (have tested all 256^3 combinations) had some false positives (did happen to me before for fast SQRT without multiplication due naming conflicts with programing environment buildints). Can not find original project anymore ... I already found some discrepancy ... will update once resolved (almost there). From what I found out if `z0` is power of 2 the result of `(0xFFFFFFFF/z0)` must be incremented however there is still some discrepancy left — Spektre, Mar 07 '21 at 07:20
When I look at `((a1<<64)/z0)`, I see the initial entry: a 128-bits/64-bits, with only 64-bits registers. Isn't the probem there? — VTiTux, Mar 07 '21 at 12:07
@VTiTux no I just finished debugging. I was missing one modulo remainder therm ... and most likely got the same issue with conflicting names with hidden function inside my IDE calling different function then written in code causing false positive results. Will edit shortly ... however computing the missing term is horribly slow ... probably can be speed up a lot by some modular identity I do not see. and also there was the mising increment for the `256/z0` therms computed by `255/z0` ... — Spektre, Mar 07 '21 at 12:52

H_squared · Accepted Answer · 2013-10-18T11:58:40.907

The right way to solve such a problem, is by returning to the basics:

divide the most significant register by the denominator
calculate the quotient Q and the rest R
define a new temporary register preferrably with the same length as the other 2
the rest should occupy the most significant bits in the temporary register
shift the lesser significant register to the right by the same amount of bits contained iR and add to the result to the temporary register.
go back to step 1

after the division, the resulting rest must be casted to double, divided by the denominator then added to the quotient.

efficient way to divide a very large number stored in 2 registers by a constant

2 Answers2

Linked