8

I have a code in C which does additions in the same way as a human does, so if for example I have two arrays A[0..n-1] and B[0..n-1], the method will do C[0]=A[0]+B[0], C[1]=A[1]+B[1]...

I need help in making this function faster, even if the solution is using intrinsics.

My main problem is that I have a really big dependency problem, as the iteration i+1 depends on the carry of the iteration i, as long as I use base 10. So if A[0]=6 and B[0]=5, C[0] must be 1 and I have a carry of 1 for the next addition.

The faster code I could do was this one:

void LongNumAddition1(unsigned char *Vin1, unsigned char *Vin2,
                      unsigned char *Vout, unsigned N) {
    for (int i = 0; i < N; i++) {
        Vout[i] = Vin1[i] + Vin2[i];
    } 

    unsigned char carry = 0;

    for (int i = 0; i < N; i++) {
        Vout[i] += carry;
        carry = Vout[i] / 10;
        Vout[i] = Vout[i] % 10;
    }
}

But I also tried these approaches which turned out being slower:

void LongNumAddition1(unsigned char *Vin1, unsigned char *Vin2,
                      unsigned char *Vout, unsigned N) {
    unsigned char CARRY = 0;
    for (int i = 0; i < N; i++) {
        unsigned char R = Vin1[i] + Vin2[i] + CARRY;
        Vout[i] = R % 10; CARRY = R / 10;
    }
}

void LongNumAddition1(char *Vin1, char *Vin2, char *Vout, unsigned N) {
    char CARRY = 0;
    for (int i = 0; i < N; i++) {
        char R = Vin1[i] + Vin2[i] + CARRY;
        if (R <= 9) {
            Vout[i] = R;
            CARRY = 0;
        } else {
            Vout[i] = R - 10;
            CARRY = 1;
        }
    }
}

I've been researching in google and found some pseudocodes which were similar to what I have implemented, also inside GeeksforGeeks there's another implementation to this problem but it is also slower.

Can you please help me?

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • Well what kind of performance gain do you expect? All those algorithems are in `O(n)`. The first one locks like a carry lock ahead adder which would be the fastest hardware implementation. I think the only chance on getting this faster is using a vector addition. But if you use `-O3` the compiler is mostlikly doing this already. – Ackdari Apr 16 '20 at 12:19
  • @Ackdari as you said, I'm already using -O3 and -funroll-loops but still perf tool, as well as vallgrind and qcachegrind says that this function consumes 99.89% of the total time, that's why I need it to improve someway... Can I change the structure in some way to achieve that? Or intrinsics or something? – Jonathan Sánchez Apr 16 '20 at 12:33
  • 1
    division (and modulus), compared with addition (and subtraction) are very unperformant. `if (a+b > 9) carry = 1; else carry = 0;` – pmg Apr 16 '20 at 12:40
  • 1
    A point which could definitly improve performance would be to use more of the value range of the elements in the array. At the moment you use 10 values from a range of 256 possible value. I would addvise you to also use different data type like `uint_fast32_t` because that why you would need less _real_ additons to do your addition. You might want to [checkout](https://github.com/dotnet/runtime/blob/2eef5a3dc0f9afeb07a1aada1c5312fc013b7871/src/libraries/System.Runtime.Numerics/src/System/Numerics/BigIntegerCalculator.AddSub.cs#L59) how it is implemented in .net core – Ackdari Apr 16 '20 at 12:46
  • or try one lib of [this](https://en.wikipedia.org/wiki/List_of_arbitrary-precision_arithmetic_software) list to not bother with it yourself – Ackdari Apr 16 '20 at 12:47
  • @pmg Thank you for editing the answer below, I will try that too! – Jonathan Sánchez Apr 16 '20 at 13:00
  • @JonathanSánchez pmg did not edit the answer. The answer simply provided credit for the idea. – chux - Reinstate Monica Apr 16 '20 at 13:05
  • @Ackdari I was thinking on that too, but I don't know how to achieve that (or if will bring performance), in fact if my array is A=[9,6,5,3,2] and B=[3,2,1,4,5] I would need to add them like this: 23569+54123. And the result must be reversed too. That's (fast thinking) a loop for reversing the first time, 4 multiplications of 10 to have the whole number together in an integer, the sum, and the reversing again. Is that what you meant, or were you talking about other stuff? – Jonathan Sánchez Apr 16 '20 at 13:07
  • @chux-ReinstateMonica Oh, okey! Thank you too! :) – Jonathan Sánchez Apr 16 '20 at 13:09
  • What do you need this for? There might be a better solution for that. – Thorbjørn Ravn Andersen Apr 16 '20 at 13:11
  • 1
    @JonathanSánchez I was talking about not storing the number as a base-10 number but as a for example a base-128 number if you want to use `uint_fast8_t` or as a base-(2^16) if you want to use `uint_fast32_t` as elementen typr for the arrays. – Ackdari Apr 16 '20 at 13:33
  • 1
    You then obviously need to conert your base-10 number to a base-128 number, but if you do multiple additions for the same data, that should speed the things at least a little bit up – Ackdari Apr 16 '20 at 13:34
  • If you want to do this effectively, you should vuiew numbers as polynomials and use polynomial arithmetics, with a final pass for carries. – Michaël Roy Apr 16 '20 at 13:36
  • @ThorbjørnRavnAndersen I'm writing a code to work with Big Integer Numbers, and I read a post here [link](http://faculty.cse.tamu.edu/djimenez/ut/utsa/cs3343/lecture20.html) where this was a good way to go, but I needed a way to do sums between two arrays fast. – Jonathan Sánchez Apr 16 '20 at 13:37
  • You should have read the whole article _"A practical way to get more out of this algorithm is to increase `BASE`"_ this is the way to go if you really improve the performance of this algorithem – Ackdari Apr 16 '20 at 13:45
  • @Ackdari maybe it sound really "newbie" (because I am in C) but I don't see how converting my numbers from base 10 to base 128 will help in performance. I mean, if I use base 128 they will remain the same as with base 10, the maximum number I achieve is 18 (9+9) right? I've read the whole article but I don't know how... – Jonathan Sánchez Apr 16 '20 at 13:47
  • @MichaëlRoy do you have any post to read from it? Thank you! – Jonathan Sánchez Apr 16 '20 at 13:49
  • 1
    @JonathanSánchez it would be better because you have to do less iterations of the for-loop. It is the same reason why it is more easy to do addition on paper for base-10 numbers as per hand addition for binary numbers – Ackdari Apr 16 '20 at 13:50
  • 1
    @Ackdari Thank you for the explanation, seems good. So then, if I understood correctly, my next step should be to convert every position of my array from base 10 to 128, and work with these new numbers, right? – Jonathan Sánchez Apr 16 '20 at 14:04
  • @Ackdari *a base-128 number* Might as well go all the way to base-256 in each `uint_fast8_t`, and pack 8 of them into a `uint_fast64_t`. I strongly suspect that would be able to be simplified to something that might border on being amazingly trivial and blazingly fast.... ;-) – Andrew Henle Apr 16 '20 at 14:17
  • 1
    @JonathanSánchez What you are doing here is called ["binary coded decimal"](https://en.wikipedia.org/wiki/Binary-coded_decimal). It's a data format that's been around a ***long*** time as it allows precise calculations on, for example, financial data. Imagine how useful that would be on CPUs without floating point processing that are based on 8- or 16-bit instructions. As others have alluded to, there are quite a few ways to [pack more than one BCD decimal into larger data elements](https://en.wikipedia.org/wiki/Binary-coded_decimal#Packed_BCD) such as `uint32_t`. – Andrew Henle Apr 16 '20 at 14:29
  • 1
    @JonathanSánchez Polynomial arithmetics is a well-known field. There are dozens of math and programming sites with pages on the subject. Addition is the easiest to implent, but if you want to implement more complex operations, look up synthetic division, which can be used to do most arithmetc operations. Any good algebra math book would be a good source of inspiration as well. Your CPU uses polynomial arithmetics to do its binary operations, it's that wdespread. – Michaël Roy Apr 17 '20 at 01:57
  • You need to use base 2^64 (or base 10^19 which is 19 times faster than your method). See [Why to use higher base for implementing BigInt?](https://stackoverflow.com/q/10178055/995714), [How are extremely large floating-point numbers represented in memory?](https://stackoverflow.com/q/23840565/995714) – phuclv Apr 19 '20 at 13:31

5 Answers5

6

If you don't want to change the format of the data, you can try SIMD.

typedef uint8_t u8x16 __attribute__((vector_size(16)));

void add_digits(uint8_t *const lhs, uint8_t *const rhs, uint8_t *out, size_t n) {
    uint8_t carry = 0;
    for (size_t i = 0; i + 15 < n; i += 16) {
        u8x16 digits = *(u8x16 *)&lhs[i] + *(u8x16 *)&rhs[i] + (u8x16){carry};

        // Get carries and almost-carries
        u8x16 carries = digits >= 10; // true is -1
        u8x16 full = digits == 9;

        // Shift carries
        carry = carries[15] & 1;
        __uint128_t carries_i = ((__uint128_t)carries) << 8;
        carry |= __builtin_add_overflow((__uint128_t)full, carries_i, &carries_i);

        // Add to carry chains and wrap
        digits += (((u8x16)carries_i) ^ full) & 1;
        // faster: digits = (u8x16)_mm_min_epu8((__m128i)digits, (__m128i)(digits - 10));
        digits -= (digits >= 10) & 10;

        *(u8x16 *)&out[i] = digits;
    }
}

This is ~2 instructions per digit. You'll need to add code to handle the tail-end.


Here's a run-through of the algorithm.

First, we add our digits with our carry from the last iteration:

lhs           7   3   5   9   9   2
rhs           2   4   4   9   9   7
carry                             1
         + -------------------------
digits        9   7   9  18  18  10

We calculate which digits will produce carries (≥10), and which would propagate them (=9). For whatever reason, true is -1 with SIMD.

carries       0   0   0  -1  -1  -1
full         -1   0  -1   0   0   0

We convert carries to an integer and shift it over, and also convert full to an integer.

              _   _   _   _   _   _
carries_i  000000001111111111110000
full       111100001111000000000000

Now we can add these together to propagate carries. Note that only the lowest bit is correct.

              _   _   _   _   _   _
carries_i  111100011110111111110000
(relevant) ___1___1___0___1___1___0

There are two indicators to look out for:

  1. carries_i has its lowest bit set, and digit ≠ 9. There has been a carry into this square.

  2. carries_i has its lowest bit unset, and digit = 9. There has been a carry over this square, resetting the bit.

We calculate this with (((u8x16)carries_i) ^ full) & 1, and add to digits.

(c^f) & 1     0   1   1   1   1   0
digits        9   7   9  18  18  10
         + -------------------------
digits        9   8  10  19  19  10

Then we remove the 10s, which have all been carried already.

digits        9   8  10  19  19  10
(d≥10)&10     0   0  10  10  10  10
         - -------------------------
digits        9   8   0   9   9   0

We also keep track of carries out, which can happen in two places.

Veedrac
  • 58,273
  • 15
  • 112
  • 169
  • I've implemented it in the code and it made the code 4x times faster. That's a huge difference! In fact is the power of SIMD vectorization. And you made it work. I've been trying to understand this code for some hours now and i can't still figure how you manage the bits. I understand everything unless this line `digits += (((u8x16)carries_i) ^ full) & 1;` Can you give me a brief explanation? After your answer I will close this topic with this question as the valid one! Again, thank you! You're amazing! – Jonathan Sánchez Apr 17 '20 at 18:25
  • @JonathanSánchez Updated – Veedrac Apr 17 '20 at 19:05
  • Thank you! It really helped me out :) – Jonathan Sánchez Apr 17 '20 at 20:15
  • How could this sum code of two arrays uint8_t * be adapted to use uint32_t *? @Veedrac – Marc Apr 22 '20 at 12:53
  • 2
    @Marc It shouldn't be that different, just use a u32 SIMD array and adjust the comparisons. – Veedrac Apr 22 '20 at 16:35
  • I have tried adjusting the comparisons and executing the function with my uint32_t * arrays, but I still do not understand how to adjust the comparisons? How could I do it? @Veedrac – Marc Apr 22 '20 at 16:43
  • @Marc Figure out what values each bundle of digits overflows. You'll need to check something like 10⁹ and 10⁹-1. But maybe just ask another question on Stack Overflow. – Veedrac Apr 23 '20 at 06:29
  • I took into account the verification of 10 ^ 9 and 10 ^ 9-1. However, the result I get is not correct. Would any other verification be missing? @Veedrac – Marc Apr 23 '20 at 17:38
  • 1
    @Marc idk, maybe you forgot to change the shift amount. You're the one with the code. – Veedrac Apr 23 '20 at 23:55
  • Thanks, however I need change the type of my vector if I use uint32_t* no? @Veedrac – Marc Apr 24 '20 at 11:37
4

Candidates for speed improvement:

Optimizations

Be sure you have enabled you compiler with its speed optimizations settings.

restrict

Compiler does not know that changing Vout[] does not affect Vin1[], Vin2[] and is thus limited in certain optimizations.

Use restrict to indicate Vin1[], Vin2[] are not affected by writing to Vout[].

// void LongNumAddition1(unsigned char  *Vin1, unsigned char *Vin2, unsigned char *Vout, unsigned N)
void LongNumAddition1(unsigned char * restrict Vin1, unsigned char * restrict Vin2,
   unsigned char * restrict Vout, unsigned N)

Note: this restricts the caller from calling the function with a Vout that overlaps Vin1, Vin2.

const

Also use const to aid optimizations. const also allows const arrays to be passed as Vin1, Vin2.

// void LongNumAddition1(unsigned char * restrict Vin1, unsigned char * restrict Vin2,
   unsigned char * restrict Vout, unsigned N)
void LongNumAddition1(const unsigned char * restrict Vin1, 
   const unsigned char * restrict Vin2, 
   unsigned char * restrict Vout, 
   unsigned N)

unsigned

unsigned/int are the the "goto" types to use for integer math. Rather than unsigned char CARRY or char CARRY, use unsigned or uint_fast8_t from <inttypes.h>.

% alternative

sum = a+b+carry; if (sum >= 10) { sum -= 10; carry = 1; } else carry = 0; @pmg or the like.


Note: I would expect LongNumAddition1() to return the final carry.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • 2
    Won't adding `restrict` to `Vin1` and `Vin2` break adding a number to itself? – 0x5453 Apr 16 '20 at 12:36
  • @0x5453 Yes, `restrict` would break that - good observation. Yet if OP is looking for optimal speed in general, then it would be better to also provide an `+=` function to handle that special case. – chux - Reinstate Monica Apr 16 '20 at 12:39
  • In fact I did know about restrict and const, but I didn't realize I could use them, really appreciate that! Also, by using uint_fast8_t which is the one I would need, why would it speed up? Shall I change only the carry to this new structure or can I change the vectors too to get even more optimization? – Jonathan Sánchez Apr 16 '20 at 12:41
  • @JonathanSánchez Code could go either way. Recommend to start with just the internals of `LongNumAddition1()` and use `uint_fast8_t carry;` For very large arrays, `uint_fast8_t` could slow things down due to 2x or 4x memory size. For the one `carry`, this size difference is not important as much as using the processor's favorite size. – chux - Reinstate Monica Apr 16 '20 at 12:47
  • @JonathanSánchez I suspect you best approach to write a test harness that exercises this code as used. That harness would report the time and others could run it. Perhaps this all better on [codereview](https://codereview.stackexchange.com/)? IAC, this kind of optimization (micro-optimization) is often not as efficient (measured in you code development time) as higher level work . Good luck. – chux - Reinstate Monica Apr 16 '20 at 12:51
  • Okey, I will then try it for the carry. As long as my array is over 10000 positions, maybe it's not the best option to change it too. Also, what did you mean in the comment above of the restrict? That I won't be able to do Vin1+Vin2 if I add the restrict keyword, or that I won't be able to do Vin1[i]+=1? Apart from that, to clarify your note in the answer, in fact the original version did return it, but as long as I don't use it for anything i just avoided it. – Jonathan Sánchez Apr 16 '20 at 12:54
  • @JonathanSánchez `restrict` is a contract between the caller and the function. Oversimplified: caller promises reading/writing to `Vout` will not change/collide with `Vin1, Vin2`. Calling `LongNumAddition1foo(a, b, a, 42);` breaks that promise. `LongNumAddition1(a,b,c,42)` is still OK. `c` can't overlap `a,b`. – chux - Reinstate Monica Apr 16 '20 at 13:01
  • The biggest optimization here by far is merging the two loops into one. I fiddled around a bit with restrict and int size etc too but it didn't help the machine code much at all. – Lundin Apr 16 '20 at 13:32
  • Don't use `uint_fast8_t`. “fast” is a misnomer; it's just an integer where you don't know the size. You shouldn't use “unsigned” either, since it's a waste of space. `uint8_t` is good. – Veedrac Apr 16 '20 at 15:12
  • @Veedrac If space was the issue, I'd agree. Speed performance is the issue and I do not need to the size. Processor instruction sets favor certain types for speed - hence the reason for and usefulness of `fast` types. Scant reason to think a `uint_fast8_t` would be slower than `unint8_t` whereas the reverse is certainly true or equal. To be clear, I am not advocating `unsigned/uint_fast_8` for the digits (where size would be an impact * `n`), as digit count is in the 10000s, just `carry` where its size inconsequential. – chux - Reinstate Monica Apr 16 '20 at 15:55
  • If you're just changing `carry` there's zero difference between the three; codegen is identical. “Fast” is still a misnomer in the general case; larger types are significantly more likely to obstruct vectorization, for example, and the idea that processors only work efficiently on register-size values hasn't been true for ages. – Veedrac Apr 16 '20 at 16:11
  • The `_fastN` types make some sense if you're programming for oddball architectures with wack-o register sizes and such. For x86 it just adds inter-compiler variability. – Veedrac Apr 16 '20 at 16:15
2

To improve the speed of your bignum addition, you should pack more decimal digits into array elements. For example: you can use uint32_t instead of unsigned char and store 9 digits at a time.

Another trick to improve performance is you want to avoid branches.

Here is a modified version of your code without tests:

void LongNumAddition1(const char *Vin1, const char *Vin2, char *Vout, unsigned N) {
    char carry = 0;
    for (int i = 0; i < N; i++) {
        char r = Vin1[i] + Vin2[i] + CARRY;
        carry = (r >= 10);
        Vout[i] = r - carry * 10;
    }
}

Here is a modified version dealing with 9 digits at a time:

#include <stdint.h>

void LongNumAddition1(const uint32_t *Vin1, const uint32_t *Vin2, uint32_t *Vout, unsigned N) {
    uint32_t carry = 0;
    for (int i = 0; i < N; i++) {
        uint32_t r = Vin1[i] + Vin2[i] + CARRY;
        carry = (r >= 1000000000);
        Vout[i] = r - carry * 1000000000;
    }
}

You can look at the code generated by gcc and clang on GodBolt's Compiler Explorer.

Here is a small test program:

#include <inttypes.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

int LongNumConvert(const char *s, uint32_t *Vout, unsigned N) {
    unsigned i, len = strlen(s);
    uint32_t num = 0;
    if (len > N * 9)
        return -1;
    while (N * 9 > len + 8)
        Vout[--N] = 0;
    for (i = 0; i < len; i++) {
        num = num * 10 + (s[i] - '0');
        if ((len - i) % 9 == 1) {
            Vout[--N] = num;
            num = 0;
        }
    }
    return 0;
}

int LongNumPrint(FILE *fp, const uint32_t *Vout, unsigned N, const char *suff) {
    int len;
    while (N > 1 && Vout[N - 1] == 0)
        N--;
    len = fprintf(fp, "%"PRIu32"", Vout[--N]);
    while (N > 0)
        len += fprintf(fp, "%09"PRIu32"", Vout[--N]);
    if (suff)
        len += fprintf(fp, "%s", suff);
    return len;
}

void LongNumAddition(const uint32_t *Vin1, const uint32_t *Vin2,
                     uint32_t *Vout, unsigned N) {
    uint32_t carry = 0;
    for (unsigned i = 0; i < N; i++) {
        uint32_t r = Vin1[i] + Vin2[i] + carry;
        carry = (r >= 1000000000);
        Vout[i] = r - carry * 1000000000;
    }
}

int main(int argc, char *argv[]) {
    const char *sa = argc > 1 ? argv[1] : "123456890123456890123456890";
    const char *sb = argc > 2 ? argv[2] : "2035864230956204598237409822324";
#define NUMSIZE  111  // handle up to 999 digits
    uint32_t a[NUMSIZE], b[NUMSIZE], c[NUMSIZE];
    LongNumConvert(sa, a, NUMSIZE);
    LongNumConvert(sb, b, NUMSIZE);
    LongNumAddition(a, b, c, NUMSIZE);
    LongNumPrint(stdout, a, NUMSIZE, " + ");
    LongNumPrint(stdout, b, NUMSIZE, " = ");
    LongNumPrint(stdout, c, NUMSIZE, "\n");
    return 0;
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • `carry = (r >= 10)` *is* a test. It might help the compiler avoid emitting a branch instruction, opting for a conditional move instead. Whether it's better than the `carry = r/10` in the OP depends on the exact CPU type. On some CPUs with hardware division but no conditional move, the division will be faster, on other CPUs without hardware division but conditional move instructions, the comparison will be faster. – cmaster - reinstate monica Apr 16 '20 at 14:19
  • I really like this answer. Anyways I've tried the code and as you can see [here](https://imgur.com/a/B7dcTgH) and it has a really strange output. The two operands are 25~ positions each, and I suppose they convert to 100 when doing the LongNumConvert. After that the addition is done, and it results in a number of 230~ positions, do I have to convert this number or what happened? – Jonathan Sánchez Apr 16 '20 at 14:31
  • @cmaster-reinstatemonica: `(c >= 10)` is a comparison. Depending on the CPU, it might generate a branch, but on current architectures it does not. r/10 usually does not compile to a hardware division, but to a multiplication and a shift. But as you say, it depends on the CPU. – chqrlie Apr 16 '20 at 14:57
  • @JonathanSánchez: sorry, my bad, big bug in `LongNumConvert `: I used the same index `i` for characters in the string and elements of the output array... – chqrlie Apr 16 '20 at 15:05
  • Wow, it's impressive, it does work now!! I want to ask you a final question, if A and B both are len N, is there any change needed to LongNumConvert? – Jonathan Sánchez Apr 16 '20 at 15:52
  • @JonathanSánchez: this imprementation is simplistic, it assumes that `a`, `b` and `c` are arrays of size `N`, which allow for number of up to `9*N` decimal digits. – chqrlie Apr 16 '20 at 15:58
  • Hi again, i've been doing some more tests and I've seen that it's doing the additions in a different order. So for example: `567854513218631683125312535 + 1` should be `667854513218631683125312535`. And the actual output is: `567854513218631683225312535` Where as you can see, has added the digit to the last group of 9: 12 turned out to 22. How can this be solved? Thank you!! – Jonathan Sánchez Apr 16 '20 at 17:02
  • @JonathanSánchez: sorry, I was not thinking clearly... the string was converted in a bogus order. I updated the answer with a corrected and simplified version. – chqrlie Apr 16 '20 at 20:04
2

It is always rather pointless to discuss manual optimizations without a specific system in mind. If we assume you have some sort of mainstream 32-bitter with data cache, instruction cache and branch prediction, then:

  • Avoid the multiple loops. You should be able to merge them into one and thereby get a major performance boost. That way you don't have to touch the same memory area multiple times and you will reduce the total amount of branches. Every i < N must be checked by the program, so reducing the amount of checks should give better performance. Also, this could improve data caching possibilities.

  • Do all operations on largest aligned word size supported. If you have a 32 bitter, you should be able to have this algorithm work on 4 bytes at a time, rather than byte by byte. This means swapping out the byte by byte assignments for a memcpy somehow, doing 4 bytes at a time. That's how library quality code does it.

  • Qualify the parameters properly. You really ought to be familiar of the term of const correctness. Vin1 and Vin2 aren't changed, so these should be const and not just for the sake of performance, but for the sake of program safety and readability/maintainability.

  • Similarly, if you can vouch that the parameters are not pointing at overlapping memory regions, you can restrict qualify all the pointers.

  • Division is an expensive operation on many CPUs, so if it is possible to change the algorithm to get rid of / and %, then do that. If the algorithm is done on byte by byte basis, then you can sacrifice 256 byte of memory to hold a look-up table.

    (This assuming that you can allocate such a look-up table in ROM without introducing wait state dependencies etc).

  • Changing carry to a 32 bit type may give better code on some systems, worse on other. When I tried this out on x86_64, it gave slightly worse code by one instruction (very minor difference).

Community
  • 1
  • 1
Lundin
  • 195,001
  • 40
  • 254
  • 396
2

The first loop

for (int i = 0; i < N; i++) {
    Vout[i] = Vin1[i] + Vin2[i];
} 

is auto-vectorized by the compiler. But the next loop

for (int i = 0; i < N; i++) {
    Vout[i] += carry;
    carry = Vout[i] / 10;
    Vout[i] = Vout[i] % 10;
}

contains a loop-carried dependence, which essentially serializes the entire loop (consider adding 1 to 99999999999999999 - it can only be computed sequentially, 1 digit at a time). Loop-carried dependence is one of the biggest headaches in modern computer science.

So that's why the first version is faster - it is partially vectorized. This is not the case with any other version.

How can the loop-carried dependence be avoided?

Computers, being base-2 devices, are notoriously bad with base-10 arithmetic. Not only does it waste space, it also creates artificial carry dependencies between every digit.

If you can turn your data from base-10 to base-2 representation, then it will become easier for the machine to add two arrays because the machine can easily perform binary addition of multiple bits in a single iteration. A well-performing representation could be for example uint64_t for a 64-bit machine. Note that streaming addition with carry is still problematic for SSE, but some options exist there as well.

Unfortunately still it's hard for C compilers to generate efficient loops with carry propagation. For this reason for example libgmp implements bignum addition not in C but in the assembly language using the ADC (add with carry) instruction. By the way, libgmp could be a direct drop-in replacement for a lot of bignum arithmetic functions in your project.

rustyx
  • 80,671
  • 25
  • 200
  • 267
  • As you said, I tried to vectorize the code by splitting it in two fors... In fact, now I found that I can split it in three and vectorize the third loop too. The change then is to move the last Vout[i]=Vout[i]%10 to another for instead of the second one. My main problem now remains in the second loop, where I add the carry and calculate the next one. I've read the post you linked of SSE problems, but I cannot extract how to perform the partial-word arithmetic. Also I've seen libgmp code for [adding with carry] (https://github.com/haiku/buildtools/blob/master/gcc/gmp/mpn/cray/add_n.c). – Jonathan Sánchez Apr 16 '20 at 15:28
  • And as I'm a newbie in terms of C, I didn't understand the low level operations they do as much as to replace my code with them... Can you please help me with that paragraph talking about the base 2 representation? I've understood that I must change my digits from base 10 to base 2 but I don't understand the uint64_t part... Are you talking about combining 8 "digits" into an uint64? Thank you. – Jonathan Sánchez Apr 16 '20 at 15:30
  • That's a Cray version. It simply carries over the 64th bit, and will not perform on x86, which has a hardware carry flag. x86 version is [here](https://github.com/alisw/GMP/blob/master/mpn/x86_64/aors_n.asm). I don't see a way to meaningfully improve your second loop without changing the data representation. You could try unrolling, i.e. computing 4..8 digits in a single iteration, but that's about it I guess... By data representation I mean binary representation like `123` represented as `1*10^2 + 2*10 + 3 = binary 01111011` – rustyx Apr 16 '20 at 15:39
  • My processor has a x86-64 ISA, so I will go for that new link you posted in this reply. But anyways, what I don't really fully understand is what do I have to change in my data representation... As you said 123 would be represented as 01111110 in binary, may I get 3 or 4 positions in my array and pack them into binary and save them in an uint64 variable? That's what I don't get... And sorry for bothering – Jonathan Sánchez Apr 16 '20 at 15:46
  • Yes you can pack 9 digits in 32 bits or 19 digits in 64 bits and easily add 9 or 19 digits per iteration, but that won't be truly base-2. In base-2 you pack all the digits in as many bits as needed. libgmp has a function [mpn_set_str](https://github.com/alisw/GMP/blob/master/mpn/generic/set_str.c) for that. – rustyx Apr 16 '20 at 16:04