Fastest way to calculate a digit-sum for a large number (as a decimal string)

Question

I use gmplib to get big number and I calculate the numeric value (sum of the digits: 123 -> 6, 74 -> 11 -> 2)

Here is what I did :

unsigned short getnumericvalue(const char *in_str)
{
    unsigned long number = 0;
    const char *ptr = in_str;
     
     do {
         if (*ptr != '9') number += (*ptr - '0'); // Exclude '9'
         ptr++;
     } while (*ptr != 0);
     
     unsigned short reduced = number % 9;
    
     return reduced == 0 ? 9 : reduced;
}

It works well but is there a faster way to do it on a Xeon w-3235?

Did you try something with SSE? You can use [this](https://stackoverflow.com/q/36998538/555045) for part of it — harold, Sep 01 '20 at 09:14
How do you know you need to make your code faster? You have profiled it and found that this conversion is a performance bottleneck? — Andrew Henle, Sep 01 '20 at 09:21
Converting the bignum to a string will be much slower than summing the digits anyway, so this optimization seems pointless. — interjay, Sep 01 '20 at 09:22
Note that code is a lot easier to read if you use character literals instead of hard-coding ASCII values. — fuz, Sep 01 '20 at 09:32
What architecture and instruction set extensions are you programming for? This code looks like it is very easy to vectorise. Note that your attempt to exclude nines is likely going to make the code slower than just leaving them in. — fuz, Sep 01 '20 at 09:33
Why not directly compute `N % 9` (without converting your number to a string first) and replacing `0` by `9`, unless `N==0`? Or is your input a string? — chtz, Sep 01 '20 at 09:33
interjay, chtz I don't have a number , i have a string (const char *) — Stephane, Sep 01 '20 at 09:37
@Stephane Are you allowed to use all instruction set extensions this chip has? If yes, I need to know the exact model of your CPU, not just the microarchitecture. If no, please let me know what instruction set extensions I can use. Also, do you prefer a solution in assembly or one in intrinsics? — fuz, Sep 01 '20 at 09:39
fuz, i prefer assembly , intel xeon w-3235 ( https://www.intel.com/content/www/us/en/products/processors/xeon/w-processors/w-3235.html ) — Stephane, Sep 01 '20 at 09:42
@Stephane Okay. Note that I do not have macOS at hand, so some copy-editing will be required to make the code run on macOS. — fuz, Sep 01 '20 at 09:43
fuz, excluding the number 9 allows you to have a smaller number to reduce afterwards — Stephane, Sep 01 '20 at 09:54
@Stephane And how does this matter? The division is going to take a few cycles more in the worst case, but it only runs once. The conditional move runs every iteration of the loop. That's not fast. — fuz, Sep 01 '20 at 09:55
Note: code has undefined behavior with `getnumericvalue("");`. Suggest using a `while (*ptr != 0) { }`. loop. — chux - Reinstate Monica, Sep 01 '20 at 12:37
"but is there a fastest way to do it ?" --> step 1, `unsigned long number` --> `unsigned number`. No need for `long` here unless strings are longer than `UINT_MAX/10` — chux - Reinstate Monica, Sep 01 '20 at 12:43
@Stephane: your fix for `getnumericvalue("9")` breaks `getnumericvalue("0")`, which should return `0` IMHO. Also avoid patching the code in the question as it makes the comments and answers inconsistent. A simpler convention would be `return number % 9;`, ie: both `getnumericvalue("0")` and `getnumericvalue("9") produce `0`. — chqrlie, Sep 02 '20 at 14:47

fuz · Accepted Answer · 2020-09-04T09:51:45.983

You can use code like the following. The general idea of the algorithm is:

process data bytewise until we reach cacheline alignment
read one cacheline at a time, check for end of string, and add digits to 8 accumulators
reduce 8 accumulators to one and add counts from head
process remainder bytewise

Note that the code below has not been tested.

        // getnumericvalue(ptr)
        .section .text
        .type getnumericvalue, @function
        .globl getnumericvalue
getnumericvalue:
        xor %eax, %eax          // digit counter

        // process string until we reach cache-line alignment
        test $64-1, %dil        // is ptr aligned to 64 byte?
        jz 0f

1:      movzbl (%rdi), %edx     // load a byte from the string
        inc %rdi                // advance pointer
        test %edx, %edx         // is this the NUL byte?
        jz .Lend                // if yes, finish this function
        sub $'0', %edx          // turn ASCII character into digit
        add %edx, %eax          // and add to counter
        test $64-1, %dil        // is ptr aligned to 64 byte?
        jnz 1b                  // if not, process more data

        // process data in cache line increments until the end
        // of the string is found somewhere
0:      vpbroadcastd zero(%rip), %zmm1  // mask of '0' characters
        vpxor %xmm3, %xmm3, %xmm3       // vectorised digit counter

        vmovdqa32 (%rdi), %zmm0         // load one cache line from the string
        vptestmb %zmm0, %zmm0, %k0      // clear k0 bits if any byte is NUL
        kortestq %k0, %k0               // clear CF if a NUL byte is found
        jnc 0f                          // skip loop if a NUL byte is found

        .balign 16
1:      add $64, %rdi                   // advance pointer
        vpsadbw %zmm1, %zmm0, %zmm0     // sum groups of 8 bytes into 8 words
                                        // also subtracts '0' from each byte
        vpaddq %zmm3, %zmm0, %zmm3      // add to counters
        vmovdqa32 (%rdi), %zmm0         // load one cache line from the string
        vptestmb %zmm0, %zmm0, %k0      // clear k0 bits if any byte is NUL
        kortestq %k0, %k0               // clear CF if a NUL byte is found
        jc 1b                           // go on unless a NUL byte was found

        // reduce 8 vectorised counters into rdx
0:      vextracti64x4 $1, %zmm3, %ymm2  // extract high 4 words
        vpaddq %ymm2, %ymm3, %ymm3      // and add them to the low words
        vextracti128 $1, %ymm3, %xmm2   // extract high 2 words
        vpaddq %xmm2, %xmm3, %xmm3      // and add them to the low words
        vpshufd $0x4e, %xmm3, %xmm2     // swap qwords into xmm2
        vpaddq %xmm2, %xmm3, %xmm3      // and add to xmm0
        vmovq %xmm3, %rdx               // move digit counter back to rdx
        add %rdx, %rax                  // and add to counts from scalar head

        // process tail
1:      movzbl (%rdi), %edx     // load a byte from the string
        inc %rdi                // advance pointer
        test %edx, %edx         // is this the NUL byte?
        jz .Lend                // if yes, finish this function
        sub $'0', %edx          // turn ASCII character into digit
        add %rdx, %rax          // and add to counter
        jnz 1b                  // if not, process more data

.Lend:  xor %edx, %edx          // zero-extend RAX into RDX:RAX
        mov $9, %ecx            // divide by 9
        div %rcx                // perform division
        mov %edx, %eax          // move remainder to result register
        test %eax, %eax         // is the remainder zero?
        cmovz %ecx, %eax        // if yes, set remainder to 9
        vzeroupper              // restore SSE performance
        ret                     // and return
        .size getnumericvalue, .-getnumericvalue

        // constants
        .section .rodata
        .balign 4
zero:   .byte '0', '0', '0', '0'

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/220891/discussion-on-answer-by-fuz-fastest-way-to-calculate-a-digit-sum-for-a-large-num). — Samuel Liew, Sep 03 '20 at 02:20
As @Stephane pointed out in chat, `.align 16` means `.p2align 16` on Apple clang, degrading performance by introducing a *huge* amount of NOPs. Never use `.align`, always `.balign` or `.p2align` to avoid this ambiguity. — Peter Cordes, Sep 04 '20 at 09:33

chqrlie · Answer 2 · 2020-09-02T14:19:52.210

Here is a portable solution:

it handles the first few digits naively until ptr is properly aligned.
then it loops reading 8 digits at a time and sums the digits pairwise into an accumulator. Up to 28 such operations can be performed before splitting the 64-bit accumulator into number.
the test for termination verifies that all digits in the pack have high nibbles equal to 3.
the remaining digits are handled one by one.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

unsigned getnumericvalue_simple(const char *in_str) {
    unsigned long number = 0;
    const char *ptr = in_str;

    do {
        if (*ptr != '9') number += (*ptr - '0'); // Exclude '9'
        ptr++;
    } while (*ptr != 0);

    return number <= 9 ? number : ((number - 1) % 9) + 1;
}

unsigned getnumericvalue_naive(const char *ptr) {
    unsigned long number = 0;

    while (*ptr) {
        number += *ptr++ - '0';
    }
    return number ? 1 + (number - 1) % 9 : 0;
}

unsigned getnumericvalue_parallel(const char *ptr) {
    unsigned long long number = 0;
    unsigned long long pack1, pack2;

    /* align source on ull boundary */
    while ((uintptr_t)ptr & (sizeof(unsigned long long) - 1)) {
        if (*ptr == '\0')
            return number ? 1 + (number - 1) % 9 : 0;
        number += *ptr++ - '0';
    }

    /* scan 8 bytes at a time */
    for (;;) {
        pack1 = 0;
#define REP8(x) x;x;x;x;x;x;x;x
#define REP28(x) x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x
        REP28(pack2 = *(const unsigned long long *)(const void *)ptr;
              pack2 -= 0x3030303030303030;
              if (pack2 & 0xf0f0f0f0f0f0f0f0)
                  break;
              ptr += sizeof(unsigned long long);
              pack1 += pack2);
        REP8(number += pack1 & 0xFF; pack1 >>= 8);
    }
    REP8(number += pack1 & 0xFF; pack1 >>= 8);

    /* finish trailing bytes */
    while (*ptr) {
        number += *ptr++ - '0';
    }
    return number ? 1 + (number - 1) % 9 : 0;
}

int main(int argc, char *argv[]) {
    clock_t start;
    unsigned naive_result, simple_result, parallel_result;
    double naive_time, simple_time, parallel_time;
    int digits = argc < 2 ? 1000000 : strtol(argv[1], NULL, 0);
    char *p = malloc(digits + 1);
    for (int i = 0; i < digits; i++)
        p[i] = "0123456789123456"[i & 15];
    p[digits] = '\0';

    start = clock();
    simple_result = getnumericvalue_simple(p);
    simple_time = (clock() - start) * 1000.0 / CLOCKS_PER_SEC;

    start = clock();
    naive_result = getnumericvalue_naive(p);
    naive_time = (clock() - start) * 1000.0 / CLOCKS_PER_SEC;

    start = clock();
    parallel_result = getnumericvalue_parallel(p);
    parallel_time = (clock() - start) * 1000.0 / CLOCKS_PER_SEC;

    printf("simple:   %d digits -> %u, %7.3f msec\n", digits, simple_result, simple_time);
    printf("naive:    %d digits -> %u, %7.3f msec\n", digits, naive_result, naive_time);
    printf("parallel: %d digits -> %u, %7.3f msec\n", digits, parallel_result, parallel_time);

    return 0;
}

The timings for 100 million digits:

simple:   100000000 digits -> 3, 100.380 msec
naive:    100000000 digits -> 3,  98.128 msec
parallel: 100000000 digits -> 3,   7.848 msec

Note that the extra test in the posted version is incorrect as getnumericvalue("9") should produce 9, not 0.

The parallel version is 12 times faster than the simple version.

More performance might be obtained using AVX instructions via compiler intrinsics or even assembly language, but for very large arrays, memory bandwidth seems to be the limiting factor.

`if (pack2 < 0x3030303030303030) break` doesn't that depend on the bytes following the terminating `0` also being `0`? It's probably not strict-aliasing UB if the only other way you read/write some dynamically-allocated memory is through `char*`, but it's still deserves at least a comment to do that kind of pointer-casting. And it is UB if used on a `char[]` *array*; the may-alias thing only applies to access through `char*` accessing any object, not to `char` *objects* accessed with different pointer types. — Peter Cordes, Sep 02 '20 at 13:33
Also, the "simple" way branching on `'9'` looks better than it should when there's a regular pattern to the digits. (A `'9'` every 16 elements). Modern Intel branch predictors may handle that efficiently, if the compiler didn't make branchless code. — Peter Cordes, Sep 02 '20 at 13:36
@PeterCordes: Sorry, the test for string termination was bogus. I fixed it. Indeed I was surprised to see such a small degradation with the test for `'9'`. I should look at the assembly. — chqrlie, Sep 02 '20 at 14:00
@chqrlie "getnumericvalue("9") should produce 9, not 0." yes i fixed it. — Stephane, Sep 02 '20 at 14:07
Forgot to add earlier re: memory bandwidth limits: a modern x86 desktop / laptop can usually sustain 8 bytes from RAM per core clock cycle (or more, haven't done the math recently). Your unrolled loop can maybe come close to that, at 4 uops per qword not counting the pointer increment (thanks to unrolling). If data is hot in L2 or even L3 cache (e.g. from a `read()` system call with a reasonable block size to copy from the pagecache, if you don't use mmap), AVX2 or AVX512 can go significantly faster. — Peter Cordes, Sep 02 '20 at 14:09
Here is the result i have with the simd version : simple: 100000000 digits -> 3, 54.105 msec naive: 100000000 digits -> 3, 68.657 msec parallel: 100000000 digits -> 3, 7.709 msec simd: 100000000 digits -> 3, 6.297 msec — Stephane, Sep 02 '20 at 14:19
@Stephane: the substantial difference between `simple` and `naive` it is somewhat surprising. Your CPU is much faster than mine (core i5 2,7 GHz) for these two and gives the same timing for `parallel`. `simd` is only marginally faster. Did you do multiple runs? — chqrlie, Sep 02 '20 at 14:27
@chq Yes multiple run. its clang who do that see the chat with my latest results (with gcc and clang) — Stephane, Sep 02 '20 at 17:15

Fastest way to calculate a digit-sum for a large number (as a decimal string)

2 Answers2