2

I have a code in C that multiplies each element of an array by a number(0-9), resulting in a series of base 10 digits.

My problem is that this function takes longer to run that I expected. I need it to be faster. I know that my problem when it comes to optimizing my function is the dependence of the carry. How this code could be modified to solve this problem and make the code faster? It's fine for the solution to use intrinsics or other specialized techniques.

My fastest version so far is this:

void ConstMult( uint8_t *V, size_t N, uint8_t digit )
{
  uint8_t CARRY = 0;
  for ( size_t i=0; i< N; ++i )
  {
    V[i] = V[i] * digit + CARRY;
    CARRY = ((uint32_t)V[i] * (uint32_t)0xCCCD) >> 19;
    V[i] -= (CARRY << 3) + (CARRY << 1);
  }
}

But I also tried these approaches which were slower:

uint8_t ConstMult( uint8_t *V, size_t N, uint8_t digit )
{
  uint8_t CARRY = 0;
  for ( int i=0; i< N; i++ ) 
  {
    char R = V[i] * digit + CARRY;
    CARRY = R / 10;
    R = R - CARRY*10;
    V[i] = R;
  }
  return CARRY; // may be from 0 to 9
}
uint8_t ConstMult(uint8_t *V, size_t N, uint8_t digit)
{
  uint8_t CARRY = 0;
  uint8_t ja = 0;
  for (size_t i = 0; i < N; ++i) {
    uint8_t aux = V[i] * digit;
    uint8_t R = aux + CARRY;
    CARRY = ((u_int32_t)R*(u_int32_t)0xCCCD) >> 19;
    ja = (CARRY << 3) + 2*CARRY;
    R -= ja;
    V[i] = R;
  }
  return CARRY;
}
Des
  • 31
  • 5
  • You'll get better answers if you ask questions clearly. What's in the input array (0-9 only, other)? What environment are you working in (processor, complier, options you've tried, OS)? What constraints (is GPU available)? What do you need to achieve (a few per cent, twice as fast,...)? – Gene Apr 18 '20 at 23:27
  • 1
    Looks a lot like [this question](https://stackoverflow.com/questions/61295979/multiply-large-numbers-of-50000-digits) – Tom Karzes Apr 18 '20 at 23:31
  • 3
    And [this question](https://stackoverflow.com/questions/61277166) and [this question](https://stackoverflow.com/questions/61249942). – user3386109 Apr 18 '20 at 23:35
  • 1
    One observation I'll make is that `CARRY = R / 10;` followed by `R = R - CARRY*10;` is an anti-optimization. Division instructions typically produce both the quotient and the remainder. So if you write that as `CARRY = R / 10;` and `R = R % 10;` a decent compiler will translate it to a single division, and use both results. – user3386109 Apr 18 '20 at 23:39
  • @user3386109 You're essentially right, but reputable compilers these days won't use a division instruction there at all. They'll multiply by an appropriate inverse like the poster did in the by-hand optimized code. – Gene Apr 18 '20 at 23:46
  • My input array is madre up of a series of random numbers. I'm compiling with gcc (-xc -Ofast -msse2 -flax-vector-conversions -ffast-math -funroll-all-loops --param max-unroll-times=50 -ftree-vectorize -fopt-info-vec-missed). I need to get this operation done as fast as possible. My processor is core i7 950. @Gene – Des Apr 18 '20 at 23:47
  • 1
    @Des This info belongs in the question. What is the range of the random digits? – Gene Apr 18 '20 at 23:52
  • 1
    Can you use vector math on that 2009 vintage CPU? SIMD can help here. A GPU could also crush through this in near zero time with CUDA or OpenGL/OpenCL. – tadman Apr 18 '20 at 23:58
  • Do you need to keep the digits separate, or can you convert to/from big integers? Because there are C libraries which can handle arbitrary length integer arithmetic. While most libraries are for C++, arbitrary length arithmetic is essential for cryptography, so you might be able to use the optimised math parts of a cryptography library like libgcrypt or OpenSSL (Wikipedia has a list of such libraries). – John Bayko Apr 19 '20 at 00:17
  • What is the range of `digit`? What is the range of `V[]`? – chux - Reinstate Monica Apr 19 '20 at 01:01
  • @user3386109 I think we have a contest between Des, cbas444, Jonathan Sánchez and maybe others to see who can work Stackoverflow the best. – chux - Reinstate Monica Apr 19 '20 at 01:05
  • @user3386109 ... and [lever](https://codereview.stackexchange.com/q/240749/29485) and [john](https://codereview.stackexchange.com/q/240732/29485) – chux - Reinstate Monica Apr 19 '20 at 01:18
  • I got plenty rep, I don't mind down/close voting all these timewasters. – Martin James Apr 19 '20 at 01:22
  • The range of digit is numbers from 0 to 9, and the range of V[] is N. In my case N is 10000. @chux-ReinstateMonica – Des Apr 19 '20 at 08:06
  • @JohnBayko The problem is that I can't use libraries. – Des Apr 19 '20 at 08:08
  • @tadman How could SIMD instructions help me in this case? Could I give myself some examples of how these might allow me to improve my function? – Des Apr 19 '20 at 08:11
  • 1
    Des, try table-lookup [example](https://en.wikipedia.org/wiki/Lookup_table#Counting_bits_in_a_series_of_bytes) – chux - Reinstate Monica Apr 19 '20 at 10:55
  • 2
    @Des SIMD won't help you in this case because there's a carry dependency on the previous cycle. You need to change to a bigger base instead [Why to use higher base for implementing BigInt?](https://stackoverflow.com/q/10178055/995714), [How are extremely large floating-point numbers represented in memory?](https://stackoverflow.com/q/23840565/995714), [practical BigNum AVX/SSE possible?](https://stackoverflow.com/q/27923192/995714) – phuclv Apr 19 '20 at 13:26

2 Answers2

2

Here is a function that handles the block 2 bytes at a time without divisions, using an ancillary table:

uint8_t ConstMult3(uint8_t *V, size_t N, uint8_t digit) {
#define TABLE_SIZE  ((9 * 256 + 9) * 9 + 9 + 1)
    static uint32_t table[TABLE_SIZE];
    if (!table[1]) {
        for (uint32_t x = 0; x < TABLE_SIZE; x++) {
            uint32_t u = x % 256 % 10;
            uint32_t d = (x / 256 + x % 256 / 10) % 10;
            uint32_t c = (x / 256 + x % 256 / 10) / 10;
            //table[x] = u | (d << 8) | (c << 16);
            // modified following Jerome Richard's comment
            table[x] = c | (u << 8) | (d << 16);
        }
    }
    if (N == 0 || digit <= 1) {
        if (digit == 0)
            memset(V, 0, N);
        return 0;
    } else {
        size_t CARRY = 0;

        if ((uintptr_t)V & 1) {  // V is misaligned
            int R = V[0] * digit + (uint8_t)CARRY;
            CARRY = (uint8_t)(R / 10);
            V[0] = (uint8_t)(R - CARRY * 10);
            V++;
            N--;
        }
        {   // handle aligned block 2 bytes at a time
            uint16_t *V2 = (uint16_t *)(void *)V;
            size_t N2 = N / 2;
            for (size_t i = 0; i < N2; i++) {
                uint32_t x = table[V2[i] * digit + CARRY];
                //V2[i] = (uint16_t)x;
                //CARRY = x >> 16;
                // modified following Jerome Richard's comment
                V2[i] = (uint16_t)(x >> 8);
                CARRY = (uint8_t)x;
            }
        }
        if (N & 1) {    // handle last byte
            int R = V[N - 1] * digit + (uint8_t)CARRY;
            CARRY = (uint8_t)(R / 10);
            V[N - 1] = (uint8_t)(R - CARRY * 10);
        }
        return (uint8_t)CARRY;
    }
#undef TABLE_SIZE
}

On my slow laptop, using clang 9.0 in 64-bit mode, I get these timings with ConstMult0, ConstMult1 and ConstMult2 are the functions posted in the question:

ConstMult0(1000000): 15.816ms sum0=4495507, sum=4501418
ConstMult1(1000000): 16.464ms sum0=4495507, sum=4501418
ConstMult2(1000000): 16.483ms sum0=4495507, sum=4501418
ConstMult3(1000000): 9.644ms sum0=4495507, sum=4501418

EDIT: following Jérôme Richard's comment, a small change in the table contents squeezes an extra 11% performance improvement:

ConstMult0(1000000): 15.837ms sum0=4500384, sum=4495487
ConstMult1(1000000): 16.494ms sum0=4500384, sum=4495487
ConstMult2(1000000): 16.482ms sum0=4500384, sum=4495487
ConstMult3(1000000): 8.537ms sum0=4500384, sum=4495487
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 2
    The computation is limited by the instructions on the critical path manipulating `CARRY` (sequential dependency chain). Your solution is impressive, but can be optimized by reducing again the critical path. I propose you to use the line `table[x] = c | (u << 8) | (d << 16);` to fill the table and to use the two lines `V2[i] = x >> 8;` `CARRY = (uint8_t)x;` in the main computation loop. This is about 10% faster on my machine and give the same results. – Jérôme Richard Apr 19 '20 at 22:11
  • 1
    @JérômeRichard: Excellent suggestion! I actually pondered on this alternative and thought it would not make much of a difference, without testing. Such a humbling experience. When optimizing, assume nothing, benchmark carefully and let the clock be the judge. – chqrlie Apr 20 '20 at 07:49
  • Thank you very much, for your explanation I find it an alternative to carry out the operation very effective and interesting. @JérômeRichard – Des Apr 20 '20 at 09:38
  • Thank you very much, for your explanation I find it an alternative to carry out the operation very effective and interesting. @chqrlieforyellowblockquotes – Des Apr 20 '20 at 09:39
  • If I don't misunderstand, divisions between 10 and multiplications by 10 could also be optimized using type expressions (R << 3) + (R << 1) ( multiplication). No? – Des Apr 20 '20 at 10:32
  • 1
    @Des The multiplication by 10 is usually implemented with shifts and additions, with `LEA` on x86 architectures, it is surprising that you get better timings doing it explicitly than letting the compiler optimise. Note that changing `char R` to `int R` in your second function improved performance by more than 10% on my Mac. – chqrlie Apr 20 '20 at 10:48
  • Thanks, I've been testing the code in godbolt and in this line uint32_t x = table[V2[i] * digit + CARRY]; I get the message: missed not vectorized not suitable for gather load. If I don't misunderstand this is because of making a load of non-consecutive memory positions, right? Could you fix this or optimize it in any way? – Des Apr 20 '20 at 11:05
  • 1
    @Des: I don't know how to interpret this message. The real path to efficiency is in changing the internal representation of the big number. Using a byte for each decimal digit is painfully slow. Packing decimal digits in groups of 9 per 32-bit word would be much better and using binary representation would further improve overall efficiency if there are multiple steps before the final result is computed. If `ConstMult` is used to implement the inner loop in a bignum multiplication, you don't actually need to propagate carry until you add all the intermediary steps to get the final result. – chqrlie Apr 20 '20 at 11:29
  • 1
    @Des: not propagating the carry should allow the compiler to unroll the loop and vectorize the code. If I remove the carry propagation and write `V[i] *= digit;` in the inner loop, the timing drops to `0.583ms`. Using 16-bit intermediary storage would allow you to have a `V1[i] += V[i] * digit;` as an inner operation without carry propagation and a final reduction to single decimal digits only on the final result. A 10x to 20x overall performance improvement. Packing the digits would add 5x to 10x on top of that. What is the greater picture? What is this function used for? – chqrlie Apr 20 '20 at 11:38
  • But not propagating the carry would produce a different result, wouldn't it? The fact that the digits are packed and that this way improves the performance also seems to me a very interesting concept. The function is mainly used to multiply an array by a number of a particular position of another array. – Des Apr 20 '20 at 12:41
  • 1
    @Des: multiplying the array by a individual digit of another array is the elementary step in classic multiplication. If the goal is to complete the multiplication of 2 arrays, using a different intermediary format is highly recommended. Other algorithms exist for bignum multiplication that are more efficient. See: [Arbitrary-precision arithmetic](https://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic), [Karatsuba algorithm](https://en.wikipedia.org/wiki/Karatsuba_algorithm) and [Schönhage–Strassen algorithm](https://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm). – chqrlie Apr 20 '20 at 13:03
  • I have changed the array type from uint8_t * to uint32_t * how could this function be modified to work correctly for this case? Could you give me an example, please? – Des Apr 24 '20 at 12:03
  • 1
    Changing the type is one thing, what is the new semantics? what do these `uint32_t` contains, what it the purpose of this function and what is the general picture of the project? I'm afraid this discussion is off topic here. – chqrlie Apr 24 '20 at 12:38
  • I changed the data type since I thought that if I used a larger array type the program could be more efficient when executing this function. My arrays are made up of N values, in my case 10000 values. So I thought that changing the type of my arrays from uint8_t* to uint32_t* could help the performance of the function. However, when changing my type of arrays I run into the problem of how to modify the function so that it works for my new data type. – Des Apr 24 '20 at 13:53
  • As long as you store a single decimal digit in each array element, `uint8_t` seems the correct type for the array element. You can test if you get better performance with a larger type by changing it also in the rest of the program, but the array will be 4 times larger, so the program will read and write 4 times more RAM. Sometimes results go against intuition, benchmarking is the only way to know and only for the current machine. – chqrlie Apr 24 '20 at 15:31
2

Here is another implementation (much faster than others):

void ConstMult4(uint8_t *V, size_t N, uint8_t digit)
{
    uint8_t CARRY = 0;

    const uint32_t coef7  = digit * 10000000;
    const uint32_t coef6  = digit * 1000000;
    const uint32_t coef5  = digit * 100000;
    const uint32_t coef4  = digit * 10000;
    const uint32_t coef3  = digit * 1000;
    const uint32_t coef2  = digit * 100;
    const uint32_t coef1  = digit * 10;
    const uint32_t coef0  = digit;

    static uint8_t table[10000][4];
    static int init = 1;

    if(init)
    {
        for(int i=0 ; i<10000 ; ++i)
        {
            table[i][0] = (i / 1) % 10;
            table[i][1] = (i / 10) % 10;
            table[i][2] = (i / 100) % 10;
            table[i][3] = (i / 1000) % 10;
        }

        init = 0;
    }

    for(size_t i=0 ; i<N/8*8 ; i+=8)
    {
        const uint32_t val = V[i+7]*coef7 + V[i+6]*coef6 + V[i+5]*coef5 + V[i+4]*coef4 + V[i+3]*coef3 + V[i+2]*coef2 + V[i+1]*coef1 + V[i+0]*coef0 + CARRY;

        CARRY = val / 100000000;

        const uint32_t loVal = val % 10000;
        const uint32_t hiVal = val / 10000 - CARRY * 10000;
        const uint8_t* loTablePtr = &table[loVal][0];
        const uint8_t* hiTablePtr = &table[hiVal][0];

        // Assume the compiler optimize the 2 following calls
        // (otherwise the performance could be quite bad).
        // memcpy is used to prevent performance issue due to pointer aliasing. 
        memcpy(V+i, loTablePtr, 4);
        memcpy(V+i+4, hiTablePtr, 4);
    }

    for(size_t i=N/8*8 ; i<N ; ++i)
    {
        V[i] = V[i] * digit + CARRY;
        CARRY = V[i] / 10;
        V[i] -= CARRY * 10;
    }
}

This implementation assumes that computed numbers in V and digit are actually digits. It is significantly faster than other methods by:

  • working with a bigger base internally as proposed by @phuclv (it reduces the critical path and introduce more parallelism);
  • using a lookup table as proposed by @chqrlieforyellowblockquotes (it enable the very fast computation of division/modulus operations).

This code can even be improved by using SSE 4.1 intrinsics (SIMD instructions). But at the cost of a less portable code (although it will work on most modern x86_64-based processors). Here is the implementation:

void ConstMult5(uint8_t *V, size_t N, uint8_t digit)
{
    uint8_t CARRY = 0;

    static uint8_t table[10000][4];
    static int init = 1;

    if(init)
    {
        for(int i=0 ; i<10000 ; ++i)
        {
            table[i][0] = (i / 1) % 10;
            table[i][1] = (i / 10) % 10;
            table[i][2] = (i / 100) % 10;
            table[i][3] = (i / 1000) % 10;
        }

        init = 0;
    }

    __m128i coefs1 = _mm_set_epi16(1000, 100, 10, 1, 1000, 100, 10, 1);
    __m128i coefs2 = _mm_set_epi32(10000*digit, 10000*digit, digit, digit);

    for(size_t i=0 ; i<N/16*16 ; i+=8)
    {
        // Require SSE 4.1 (thus smmintrin.h need to be included)
        const __m128i vBlock = _mm_loadu_si128((const __m128i*)&V[i]); // load 16 x uint8_t values (only half is used)
        const __m128i v = _mm_cvtepu8_epi16(vBlock); // Convert the block to 8 x int16_t values
        const __m128i tmp1 = _mm_madd_epi16(v, coefs1); // Compute the sum of adjacent pairs of v * coefs1 and put this in 4 x int32_t values
        const __m128i tmp2 = _mm_add_epi32(tmp1, _mm_shuffle_epi32(tmp1, 0b10110001)); // Horizontal partial sum of 4 x int32_t values
        const __m128i tmp3 = _mm_mul_epu32(tmp2, coefs2); // Compute tmp2 * coefs2 and put this in 2 x int64_t values
        const uint32_t val = _mm_extract_epi64(tmp3, 1) + _mm_extract_epi64(tmp3, 0) + CARRY; // Final horizontal sum with CARRY

        CARRY = val / 100000000;

        const uint32_t loVal = val % 10000;
        const uint32_t hiVal = val / 10000 - CARRY * 10000;
        const uint8_t* loTablePtr = &table[loVal][0];
        const uint8_t* hiTablePtr = &table[hiVal][0];

        // See the memcpy remark in the code above (alternative version).
        memcpy(V+i, loTablePtr, 4);
        memcpy(V+i+4, hiTablePtr, 4);
    }

    for(size_t i=N/16*16 ; i<N ; ++i)
    {
        V[i] = V[i] * digit + CARRY;
        CARRY = V[i] / 10;
        V[i] -= CARRY * 10;
    }
}

Here are performance results (repeated and averaged on 1000 run using random inputs) on my machine (with a i7-9700KF processor):

ConstMult0(10000): 11.702 us
ConstMult3(10000): 6.768 us (last optimized version)
ConstMult4(10000): 3.569 us
ConstMult5(10000): 2.552 us

The final SSE-based version is 4.6 times faster than your original implementation!

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Thanks by your response is a very interesting way to optimize. However, when running in C I don't have bool variable nor the option to use std:copy. I understand it's a C++ version, but how could I convert it to C code? – Des Apr 21 '20 at 13:24
  • How you could convert the lines: std::copy(loTablePtr, loTablePtr+4, V+i); std::copy(hiTablePtr, hiTablePtr+4, V+i+4); To be able to run the code in C? – Des Apr 21 '20 at 15:23
  • 1
    Sorry, I miss that you used C and not C++. So, I translated C++ parts to C. – Jérôme Richard Apr 21 '20 at 17:16
  • Thanks, your optimization was very interesting. – Des Apr 21 '20 at 19:00
  • I've been thinking about changing my array type from uint8_t* to uint32_t*. How would you modify this function to be able to use it for an array uint32_t*? – Des Apr 22 '20 at 12:04
  • 2
    Do you plan to use a bigger base ? If no, you can just replace the `memcpy` calls with plain loops in scalar version (for the the SSE version, the code does not support that). If yes, I advise you to use a power of two as the code will be much faster with that (probably even the original code if the base is big enough). In such case you do not need the lookup table anymore. If you need to use a base that is not a power of two and that is quite big, the table lookup optimization does not work anymore (I do not see any way to optimize the code in this case). – Jérôme Richard Apr 22 '20 at 21:06
  • @JérômeRichard: Congratulations! Did you benchmark with clang or gcc? I get consistent results with clang 9.0.0 on iOS, but gcc generates SIMD instructions which I have not timed. [Godbolt](https://godbolt.org/z/2eJkZL) shows that both gcc and clang ignore the `coef` precomputations and prefer 6 multiplications by explicit constants and an additional multiplication by digit. – chqrlie Apr 26 '20 at 16:18
  • @chqrlieforyellowblockquotes Thank you! I used GCC (I remember Clang was a bit slower but still fast). The GCC code use mostly `coef` variables (only 1 seems to be replaced by `lea` instructions). GCC starts by precomputing them. `coef` variables are not compile-time constant, since `digit` is not too. For Clang, yes, it recompute the product by compile-time constants. This is surprising. The `imul` instructions are not so bad here since most can be computed in parallel. On my CPU, `imul` have a 1-cycle throughput and a 3-cycle latency. `add`/`lea` are likely to be computed in parallel too. – Jérôme Richard Apr 26 '20 at 18:52
  • @JérômeRichard: indeed, only clang ignores the coef precomputations, gcc precomputes the `coef*` variables and produces unfathomable and lengthy SIMD code. – chqrlie Apr 26 '20 at 19:11
  • @JérômeRichard: Tu travailles sur R++ ? Très impressionnant ! J'aimerais vous rencontrer, après le confinement bien sûr :) – chqrlie Apr 26 '20 at 19:13
  • Oui. Avec plaisir :) ! Nous pouvons discuter en privé si vous le souhaitez. – Jérôme Richard Apr 26 '20 at 23:23