Why is there only little difference in integer division throughput with larger values on AMD Zen?

Question

I was curious about the differences in performance when dividing two values with varying number of value-bits but constant operand (register) bits. I expected the performance to be dependent of the highest set bit of the dividend since I assume an iterative subtract and shift "algorithm" inside the CPU, may be doing multiple iterations in one clock cycle.

So I wrote a little C++20-program that tests this:

#include <iostream>
#include <iostream>
#include <type_traits>
#include <cstdint>
#include <random>
#include <limits>
#include <atomic>
#include <chrono>
#include <vector>
#include <sstream>

using namespace std;
using namespace chrono;

int main()
{
    constexpr size_t ROUNDS = 100'000'000;
    auto ssp = []<typename TOp, typename TValue>() -> double
        requires is_unsigned_v<TOp> && is_unsigned_v<TValue> && (sizeof(TOp) >= sizeof(TValue))
    {
        constexpr size_t N = 0x1000;
        vector<TOp> vDividends( N ), vDivisors( N );
        mt19937_64 mt;
        uniform_int_distribution<unsigned> uidBits( 0, sizeof(TValue) * CHAR_BIT - 1 );
        uniform_int_distribution<uint64_t> uidValue( 0, numeric_limits<TValue>::max() );
        auto getMaxBitsValue = [&]() -> TOp
        {
            for( TValue value; ; )
                if( (make_signed_t<TValue>)(value = (TValue)uidValue( mt )) < 0 )
                    return value;
        };
        auto getDividend = [&]() -> TOp
        {
            return getMaxBitsValue() >> uidBits( mt );
        };
        auto getDivisor = [&]( TOp dividend ) -> TOp
        {
            for( TOp divisor; ; )
                if( (divisor = getMaxBitsValue() >> uidBits( mt )) <= dividend )
                    return divisor;
        };
        for( size_t i = 0; i != N; ++i )
            vDivisors[i] = getDivisor( (vDividends[i] = getDividend()) );
        auto start = high_resolution_clock::now();
        for( size_t r = ROUNDS; r--; )
            sum += vDividends[r % N] / vDivisors[r % N];
        double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (double)ROUNDS;
        return ns;
    };
    auto results = []( double ns, double nsRef )
    {
        ostringstream oss;
        oss << ns;
        if( nsRef > 0.0 )
            oss << " (" << (int)(ns / nsRef * 100.0 + 0.5) << "%)";
        return oss.str();
    };
    double nsRef;
    cout << "8v, 8op: " << results( (nsRef = ssp.template operator ()<uint8_t, uint8_t>()), 0.0 ) << endl;
    cout << "8v, 16op: " << results( ssp.template operator ()<uint16_t, uint8_t>(), nsRef ) << endl;
    cout << "8v, 32op: " << results( ssp.template operator ()<uint32_t, uint8_t>(), nsRef ) << endl;
    cout << "8v, 64op: " << results( ssp.template operator ()<uint64_t, uint8_t>(), nsRef ) << endl;
    cout << "16v, 16op: " << results( ssp.template operator ()<uint16_t, uint16_t>(), nsRef ) << endl;
    cout << "16v, 32op: " << results( ssp.template operator ()<uint32_t, uint16_t>(), nsRef ) << endl;
    cout << "16v, 64op: " << results( ssp.template operator ()<uint64_t, uint16_t>(), nsRef ) << endl;
    cout << "32v, 32op: " << results( ssp.template operator ()<uint32_t, uint32_t>(), nsRef ) << endl;
    cout << "32v, 64op: " << results( ssp.template operator ()<uint64_t, uint32_t>(), nsRef ) << endl;
    cout << "64v, 64op: " << results( ssp.template operator ()<uint64_t, uint64_t>(), nsRef ) << endl;
}

The results are the same for the same number of value bits with varying operand-bits with slight measurements-errors. But what's the CPU-architectural reason behind that a division of a 64-bit value by another 64-bit value takes only about 50% more time than a division of a 8-bit value by another 8-bit value on my computer (Ryzen Threadripper 3990X) ? Even on my Linux computer equipped with a Ryzen 7 1800X the percentage relations are roughly the same.

I post this in "x86", "x86-64" and "assembly" and not in C++ since I'm asking not a for C++-issue but a machine-level issue.

[EDIT]: This is a newer version of the code for MSVC++ with some serialized assembly-code (MASM):

#include <iostream>
#include <iostream>
#include <type_traits>
#include <cstdint>
#include <random>
#include <limits>
#include <atomic>
#include <chrono>
#include <vector>
#include <sstream>

using namespace std;
using namespace chrono;

uint64_t divLoop( uint64_t *dividends, uint64_t *divisors, size_t n, size_t rounds );
uint32_t divLoop( uint32_t *dividends, uint32_t *divisors, size_t n, size_t rounds );
uint16_t divLoop( uint16_t *dividends, uint16_t *divisors, size_t n, size_t rounds );
uint8_t divLoop( uint8_t *dividends, uint8_t *divisors, size_t n, size_t rounds );

int main()
{
    constexpr size_t ROUNDS = 100'000'000;
    auto ssp = []<typename TOp, typename TValue>( TOp const &, TValue const & ) -> double
        requires is_unsigned_v<TOp> && is_unsigned_v<TValue> && (sizeof(TOp) >= sizeof(TValue))
    {
        constexpr size_t N = 0x1000;
        vector<TOp> vDividends( N ), vDivisors( N );
        mt19937_64 mt;
        uniform_int_distribution<unsigned> uidBits( 0, sizeof(TValue) * CHAR_BIT - 1 );
        uniform_int_distribution<uint64_t> uidValue( 0, numeric_limits<TValue>::max() );
        auto getMaxBitsValue = [&]() -> TOp
        {
            for( TValue value; ; )
                if( (make_signed_t<TValue>)(value = (TValue)uidValue( mt )) < 0 )
                    return value;
        };
        auto getDividend = [&]() -> TOp
        {
            return getMaxBitsValue() >> uidBits( mt );
        };
        auto getDivisor = [&]( TOp dividend ) -> TOp
        {
            for( TOp divisor; ; )
                if( (divisor = getMaxBitsValue() >> uidBits( mt )) <= dividend )
                    return divisor;
        };
        for( size_t i = 0; i != N; ++i )
            vDivisors[i] = getDivisor( (vDividends[i] = getDividend()) );
        TOp sum = 0;
        atomic<TOp> aSum( 0 );
        auto start = high_resolution_clock::now();
        divLoop( &vDividends[0], &vDivisors[0], N, ROUNDS );
        double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (double)ROUNDS;
        aSum = sum;
        return ns;
    };
    auto results = []( double ns, double nsRef )
    {
        ostringstream oss;
        oss << ns;
        if( nsRef > 0.0 )
            oss << " (" << (int)(ns / nsRef * 100.0 + 0.5) << "%)";
        return oss.str();
    };
    double nsRef;
    cout << "8v, 8op: " << results( (nsRef = ssp( uint8_t(), uint8_t() )), 0.0 ) << endl;
    cout << "8v, 16op: " << results( ssp( uint16_t(), uint8_t() ), nsRef ) << endl;
    cout << "8v, 32op: " << results( ssp( uint32_t(), uint8_t() ), nsRef ) << endl;
    cout << "8v, 64op: " << results( ssp( uint64_t(), uint8_t() ), nsRef ) << endl;
    cout << "16v, 16op: " << results( ssp( uint16_t(), uint16_t() ), nsRef ) << endl;
    cout << "16v, 32op: " << results( ssp( uint32_t(), uint16_t() ), nsRef ) << endl;
    cout << "16v, 64op: " << results( ssp( uint64_t(), uint16_t() ), nsRef ) << endl;
    cout << "32v, 32op: " << results( ssp( uint32_t(), uint32_t() ), nsRef ) << endl;
    cout << "32v, 64op: " << results( ssp( uint64_t(), uint32_t() ), nsRef ) << endl;
    cout << "64v, 64op: " << results( ssp( uint64_t(), uint64_t()  ), nsRef ) << endl;
}

MASM:

; uint64_t divLoop( uint64_t *dividends, uint64_t *divisors, size_t n, size_t rounds );
PUBLIC ?divLoop@@YA_KPEA_K0_K1@Z
; uint32_t divLoop( uint32_t *dividends, uint32_t *divisors, size_t n, size_t rounds );
PUBLIC ?divLoop@@YAIPEAI0_K1@Z
; uint16_t divLoop( uint16_t *dividends, uint16_t *divisors, size_t n, size_t rounds );
PUBLIC ?divLoop@@YAGPEAG0_K1@Z
; uint8_t divLoop( uint8_t *dividends, uint8_t *divisors, size_t n, size_t rounds );
PUBLIC ?divLoop@@YAEPEAE0_K1@Z

_TEXT SEGMENT

; rcx = dividends
; rdx = divisors
; r8 = n
; r9 = rounds

?divLoop@@YA_KPEA_K0_K1@Z PROC
    push    r12
    push    r13
    mov     r10, rdx
    sub     r13, r13
outerLoop:
    mov     r11, r8
    cmp     r8, r9
    cmova   r11, r9
    sub     r11, 1
    jc      byebye
innerLoop:
    mov     rax, [rcx + r11 * 8]
    mov     r12, [r10 + r11 * 8]
    mov     rdx, r13
    div     r12
    mov     r13, rax
    and     r13, 0
    sub     r11, 1
    jnc     innerLoop
    sub     r9, r8
    ja      outerLoop
byebye:
    pop     r13
    pop     r12
    ret
?divLoop@@YA_KPEA_K0_K1@Z ENDP

?divLoop@@YAIPEAI0_K1@Z PROC
    push    r12
    push    r13
    mov     r10, rdx
    sub     r13, r13
outerLoop:
    mov     r11, r8
    cmp     r8, r9
    cmova   r11, r9
    sub     r11, 1
    jc      byebye
innerLoop:
    mov     eax, [rcx + r11 * 4]
    mov     r12d, [r10 + r11 * 4]
    mov     edx, r13d
    div     r12d
    mov     r13d, eax
    and     r13d, 0
    sub     r11, 1
    jnc     innerLoop
    sub     r9, r8
    ja      outerLoop
byebye:
    pop     r13
    pop     r12
    ret
?divLoop@@YAIPEAI0_K1@Z ENDP

?divLoop@@YAGPEAG0_K1@Z PROC
    push    r12
    push    r13
    mov     r10, rdx
    sub     r13, r13
outerLoop:
    mov     r11, r8
    cmp     r8, r9
    cmova   r11, r9
    sub     r11, 1
    jc      byebye
innerLoop:
    mov     ax, [rcx + r11 * 2]
    mov     r12w, [r10 + r11 * 2]
    mov     dx, r13w
    div     r12w
    mov     r13w, ax
    and     r13w, 0
    sub     r11, 1
    jnc     innerLoop
    sub     r9, r8
    ja      outerLoop
byebye:
    pop     r13
    pop     r12
    ret
?divLoop@@YAGPEAG0_K1@Z ENDP

?divLoop@@YAEPEAE0_K1@Z PROC
    push    r12
    push    r13
    mov     r10, rdx
    sub     r13, r13
outerLoop:
    mov     r11, r8
    cmp     r8, r9
    cmova   r11, r9
    sub     r11, 1
    jc      byebye
innerLoop:
    mov     al, [rcx + r11]
    mov     r12b, [r10 + r11]
    mov     dl, r13b
    mov     ah, dl
    div     r12b
    mov     r13b, al
    and     r13b, 0
    sub     r11, 1
    jnc     innerLoop
    sub     r9, r8
    ja      outerLoop
byebye:
    pop     r13
    pop     r12
    ret
?divLoop@@YAEPEAE0_K1@Z ENDP

_TEXT ENDS

END

Now this code perfectly demonstrates that divisions aren't dependent on the register-width but the operand-width. But I'm still asking myself why a 64 (operand and register) by 64 division is only 1/3 slower than an 8 by 8 divsion.

How does this maze of templates and lambdas and parameterized types actually compile with the compiler + options you used? I assume to `div r64` in a loop, but I wasted about 5 or 10 minutes trying to follow this before just looking at it on Godbolt. https://godbolt.org/z/13YnGKPq9 - clang -O3 optimizes away the loop to an empty timed region, but GCC -O3 does seem to use different operand-sizes of `div` inside a loop that's not too bloated, e.g. the `%N` compiles down to just an AND, in front of `movzx eax, BYTE PTR [rdi+rcx]` ; `div BYTE PTR [rsi+rcx]` for the u8 case. — Peter Cordes, Mar 10 '22 at 09:18

score 6 · Answer 1 · answered Mar 10 '22 at 11:40

I assume your builds were similar to GCC11.2 -O3 -std=gnu++20 https://godbolt.org/z/13YnGKPq9, making a loop without much work other than load and div. So the results are realistic, not a bigger difference hidden behind loop overhead (div is slow enough to probably not bottleneck on memory).

You're only measuring div throughput, not latency. Latency is more variable than throughput with the size of something (maybe of the dividend?), even on recent CPUs.

Modern CPUs often have near-constant throughput even when latency is still variable.

https://electronics.stackexchange.com/questions/280673/why-does-hardware-division-take-much-longer-than-multiplication/280709#280709 - modern fast dividers start with an initial guess (from a lookup table) and refine it with Newton-Raphson (using a hardware multiplier as part of the divide unit) make things somewhat fast. Perhaps pipelining it with a multiplier unit in each stage to give fixed throughput, and an early out to provide lower latency depending on the data.
See also The integer division algorithm of Intel's x86 processors and How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson? (until Ice Lake, Intel ran integer division on the same div/sqrt unit as FP). Also, unlike AMD, Intel before Ice Lake used many more uops for 64-bit div r64 than for 32-bit or smaller operand-size. If you tested your benchmark on Intel, you'd find that uint64_t made things very slow even with small values.

See also discussion in comments on What is integer division heavily used for? about improvements to HW dividers over time, although it's mostly about use-cases for division and the fact that modern CPUs have so high a transistor budget that they might as well throw some of it at a divide unit.

Your result is also expected from instruction-throughput measurements others have done.

https://uops.info/ tested 8-bit div reg8 on Zen+ with both ax=0 / bl=1 and with ax=0x3301 / bl=123 and found no difference in throughput, always about 13 or 14 cycles. (Only latency differences, from 8 to 25 cycles. Or for div r64, from 8 to 41 cycles latency depending on values and which input to which output). Latency lower than throughput is only observable when there are other instructions in the dep chain connecting outputs back to inputs, and it still seems strange to me.
Agner Fog reports latency for div r64 of 14-46 and throughput of 14-45. uops.info confirms that, but for some reason didn't get the varying throughput into the main table. The 0 / 1 "fast division" did run at one per 14 cycles, but slow division of 0x343a9ed744556677:0000000000000085 / 0x75e6e44fccddeeff takes 43 cycles. (The worst case with RDX=0 on input may be less bad; current C++ compilers never use div or idiv with an upper half that isn't zero-extended, except was part of extended precision, like unsigned __int128.)

For AMD K10, he comments that div/idiv performance: "Depends on number of significant bits in absolute value of dividend. See AMD software optimization guide." I'm unsure if that's still how divide units scale in modern AMD and Intel CPUs.
InstLatx64 also tests a few different data inputs, and which input to couple an output back into for latency, e.g. for Zen1 Ryzen 7 1800X they found a 14c to 46c throughput range for div r64.

The latency numbers for integer division on Zen2 *suggest* use of a bog-standard radix-4 SRT divider, as they are approximately 14 cycles of general overhead plus 1 cycle per every two bits of quotient, thus 16-bit `DIV` up to 22 cycles, 32-bit `DIV` up to 30 cycles, 64-bit `DIV` up to 46 cycles. For the improved integer division in Zen3 things are less clear. A higher radix SRT divider is possible, but would suggest that throughput scales inverse linear to latency. If throughput improves better than latency, that *may* suggest switch to an iterative method. Can't find anything published. — njuffa, Mar 11 '22 at 19:39
@njuffa: You're probably more familiar than I am with the details of designing an integer divide unit, and which parts might be pipelined, and how a lookup table for an initial guess fits in. There isn't really a good answer with a high-level overview of that all in one place; I've just seen hints of that in various answers and comments, other than [Why does hardware division take much longer than multiplication?](https://electronics.stackexchange.com/a/280709) which doesn't go into much detail about perf implications. Maybe you'd want to write such an answer here or a linked question. — Peter Cordes, Mar 11 '22 at 19:46
SRT division is a fancier way of doing long-hand division like we all learned in school. Innovation there in recent years is mostly use of ever higher radices. Typically state machine, no pipelining. LUT needed (for guessing the next quotient 'digit') is small. Why large overhead? Possibly for shipping data to an FP divider and back? Don't know about microarchitecture of Zen2. Iterative methods would be using Newton-Raphson or Goldschmidt for computing reciprocal; may use fairly substantial LUTs for starting approximation (few cycles for lookup). Ops making up iteration pipelined. — njuffa, Mar 11 '22 at 19:58

Why is there only little difference in integer division throughput with larger values on AMD Zen?

1 Answers1