2

I am playing with code from https://github.com/torvalds/linux/blob/master/lib/math/div64.c using https://godbolt.org/, targeting ARM Cortex-M7 microcontroller.

clang generates much more code (208 lines of assembly), comparing to gcc (69 lines). I am using -O3 and it should unroll/expand/etc, but... I am not sure if it is good in this case.

Don't know will it be helpful or not, but:
1) removing 1st or 2nd loop makes the result similar to produced by gcc;
2) clang 5.0.0 creates much smaller code.

Here is the code (below there is also the link to configured example):

unsigned int div64_32(unsigned long long * n, unsigned int base)
{
    unsigned long long rem = *n;
    unsigned long long b = base;
    unsigned long long res, d = 1;
    unsigned int high = rem >> 32;

    /* Reduce the thing a bit first */
    res = 0;
    if (high >= base) {
        high /= base;
        res = (unsigned long long) high << 32;
        rem -= (unsigned long long) (high*base) << 32;
    }

    while ((long long)b > 0 && b < rem) {
        b = b+b;
        d = d+d;
    }

    do {
        if (rem >= b) {
            rem -= b;
            res += d;
        }
        b >>= 1;
        d >>= 1;
    } while (d);

    *n = res;
    return rem;
}

Options for both compilers:
ARM GCC 7.2.1 (none): -O3 -mcpu=cortex-m7 -mthumb -mfloat-abi=hard -mfpu=vfpv4
x86-x64 clang (trunk): -O3 -mcpu=cortex-m7 -mthumb -mfloat-abi=hard -mfpu=vfpv4 -target armv7m-none-eabi

And finally the link to example: link

Update
I made tests on NXP Kinetis KV58. Frequency 240 MHz, all code relocated to I-TCM.
Results, measured by oscilloscope:

        run 1     run 2
iar:   27.2 ms   17.6 ms
clang: 29.6 ms   19.2 ms
gcc:   22.1 ms   14.0 ms

run-1 with n *= 123 (see code below)
run-2 with n *= 127

I didn't compile code for clang and gcc myself, just copy&paste it from website. I also made changes to GCC code, because compiler did not recognize instructions (also see comments to the question):
ldrd r0, [r0] (line 9) => ldrd r0, r1, [r0])
strd r6, [ip] (line 55) => strd r6, r7, [r12])

void test()
{
    debug_pin(0, 0); // test pin 0: set LOW
    for (int i = 0; i < 1000; i++);

    debug_pin(0, 1); // test pin 0: set HIGH

    unsigned long long n = 54321, base = 123;
    for (int i = 0; i < 10000; i++)
    {
        base = div64_32_iar(&n, base);

        n += 65432;
        n *= 127; // 1st test was with 123
        base += 123;
    }
    debug_pin(0, 0);

    for (int i = 0; i < 1000; i++);

    debug_pin(0, 1);
    n = 54321, base = 123;
    for (int i = 0; i < 10000; i++)
    {
        base = div64_32_clang(&n, base);

        n += 65432;
        n *= 127; // 1st test was with 123
        base += 123;
    }
    debug_pin(0, 0);

    for (int i = 0; i < 1000; i++);

    debug_pin(0, 1);
    n = 54321, base = 123;
    for (int i = 0; i < 10000; i++)
    {
        base = div64_32_gcc(&n, base);

        n += 65432;
        n *= 127; // 1st test was with 123
        base += 123;
    }
    debug_pin(0, 0);
}
Mishka
  • 43
  • 10
  • `-Os` reduces it to 77 lines (clang) versus 61 lines (gcc). – Ian Abbott Jul 12 '19 at 13:35
  • 1
    Iar is Gcc underneath afair. I think you are just seeing compiler version improvements. Clang and Gcc often track as many people contribute to both. I think you can use ‘ldrd r0, r1, [r0]’ on the m7. – artless noise Jul 12 '19 at 15:06
  • BTW GCC 7 is an old compiler. The current one is GCC 9 – Basile Starynkevitch Jul 12 '19 at 15:38
  • lines and performance are not necessarily directly related, you can have twice as much code run faster than half as much, depends on the instruction and implementation. need to do more work if interested in performance. – old_timer Jul 13 '19 at 13:43
  • [div64.S](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/lib/div64.S?h=v5.2) and [div64.h](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/include/asm/div64.h?h=v5.2) are used on ARM. Also your CPU has `udiv`, so [64/32 on CPU with 32/16](https://stackoverflow.com/questions/4771823/64-32-bit-division-on-a-processor-with-32-16-bit-division) might help. Are you benchmarking compilers or are you actually concerned with this algorithm? I would say that the consensus of what you found is that this is expected clang performance. – artless noise Jul 14 '19 at 18:48
  • I am benchmarking compilers (IAR is good but not free...) So I tested this function and clang's result seems strange to me. – Mishka Jul 15 '19 at 05:11
  • @artlessnoise IAR has their own compiler, they are not using GCC. – Johan Jul 15 '19 at 06:28
  • @Johan thanks for the correction. They must have 'C' extensions to mimic GCC and maybe even GCC version numbers? Or perhaps I was mixed up with some other group that had an ARM compiler. A major pain in using commercial compilers (Diab for instance) is getting open source build systems/projects to work. Newer versions of GCC (ver8 on godbolt) seem to give even better code for this sample. Yet a programmer could alter the code to easily outperform any of the compilers by using `udiv` or at least `clz`. – artless noise Jul 15 '19 at 13:37

0 Answers0