9

I'm using C and I have two non-negative integers n and m (both >= 0, n < 500). I need to form the product

n*(n+1)/2 + m

and this will be needed hundreds of millions of times so I want to optimize this as much as possible. My current implementation is:

inline int func(const int n, const int m) { return ( (n*(n+1) >> 1) + m); }

using inline and the >> 1 to do the divide by 2. Is there any other way to speed up this calculation?

dbush
  • 205,898
  • 23
  • 218
  • 273
vibe
  • 333
  • 1
  • 9
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/198746/discussion-on-question-by-vibe-optimize-multiply-and-add). – Samuel Liew Aug 31 '19 at 01:08

7 Answers7

8

Given that n will be less that 500, you can precompute all possible values of n*(n+1)/2 and put them in a table, then use that table to perform the computation:

int n_sum[500];

// call this once at program start
void init_sum()
{
    int i;
    for (i=0;i<500;i++) {
        n_sum[i] = i*(i+1)/2;
    }
}

inline int func(const int n, const int m)
{ 
    return n_sum[n] + m;
}
dbush
  • 205,898
  • 23
  • 218
  • 273
  • 4
    Absolutely do not do this! This is a terrible de-optimisation, and far worse than the original code! The original code will vectorise into about 6 ops per 8 or 16 ints, depending on AVX2/AVX512 (i.e. <1 op per integer). This 'optimisation' requires 4 ops per integer. LUT's are only an optimisation when the amount of work to calculate a value is large. This is not a complex operation, so do not cache into a LUT. – robthebloke Aug 30 '19 at 06:09
  • 2
    it's not sure that accessing the look up table `n_sum`at a given index will be significantly faster that computing the value `i*(i+1)/2` – Guillaume Petitjean Aug 30 '19 at 07:44
  • 1
    I did a quick n dirty try on ARM bare-metal. https://godbolt.org/z/Qqyirx. Not sure there is any benefit (same number of asm instructions, hard to deduce the computing time due to pipeline but probably similar) – Guillaume Petitjean Aug 30 '19 at 07:48
  • 1
    It’s a clever idea, and applicable to many other inner loops, but you really need to profile to see if it’s a win in practice. This will certainly be slower on any modern CPU unless the lookup table is in the CPU cache, and maybe even then, because it only saves one or two instructions here. – Davislor Aug 30 '19 at 14:44
  • If this refactoring suppresses auto-vectorization of the inline function in a loop, it’s pessimal. – Davislor Aug 30 '19 at 14:59
5

In practice, what you want to do is write a loop that the compiler can easily and efficiently vectorize and parallelize. If you have two arrays n[i] and m[i], any modern compiler can probably figure out how to optimize n[i]*(n[i]+1)/2 + m[i] if given the proper flags. Trying to force the compiler to do optimizations on one word at a time will, in general, be counterproductive. Modern hardware is fastest when you parallelize your critical loops. If you don’t want to use non-portable intrinsics or libraries designed for the purpose, you can best achieve that by minimizing data dependencies and writing code that is easy to statically analyze.

You might or might not be able to improve the generated code with (n*n + n)/2 + m, that is, converting the polynomial to nested form. This is efficient because it enables the code generator to use just one vector register as the accumulator, maximizing the number available for SIMD. You should use restrict and alignas as appropriate to enable maximum optimization.

(Edit: Right-shift of a negative number is implementation-defined, because it might be logical or arithmetic. The code I wrote does unsigned math, which lets the compiler optimize /2 to >>1 for you. In a comment, robthebloke brings up that, if you use signed rather than unsigned variables, and you know that they will always be non-negative, the compiler might not be able to statically deduce this and therefore might not optimize /2 to >>1. In that case, you can either write >>1 or cast (uint32_t)n[i] to do better-defined unsigned math. An unsafe-math optimization flag might also re-enable this.)

This kind of vectorization is likely to be faster than individual table lookups on each element.

The result will be in the range from 0 to 125,750, which is too large for an unsigned short, and therefore the smallest type that could hold it is int32_t or uint32_t. (Or uint_least32_t if you prefer.) Using an array of the smallest type allows maximum vectorization.

If you want to help the optimizer out, you can enable OpenMP and add a #pragma omp simd, to explicitly tell the compiler to vectorize this loop. You could also use OpenMP to enable multi-threading.

In C++, you have the options of std::valarray<uint32_t> or expression templates, which are very elegant ways to express an embarrassingly-parallel computation such as this one.

The following program compiles to vectorized code on GCC, Clang or ICC when given the appropriate optimization flags. Clang compiles to a loop that computes 256 elements per iteration.

#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>

#define N (1L<<20)
typedef uint_least32_t elem_t;

const elem_t n[N];
const elem_t m[N];
elem_t a[N];

int main(void)
{
    for ( ptrdiff_t  i = 0; i < N; ++i) {
      a[i] = (n[i]*n[i] + n[i])/2 + m[i];
    }

  return EXIT_SUCCESS;
}

You can attempt to add alignas specifiers to the arrays, but this will not actually cause GCC, Clang or ICC to perform aligned loads or stores. (There is a GCC extension to enable this optimization.)

If you enable the OpenMP library (-fopenmp in GCC or Clang), you can add the line

#pragma omp for

immediately before the for loop, or a more complex version, and get a loop that is both multithreaded and vectorized. If there’s a way to significantly improve on that with standard, portable C, I’d love to know about it myself.

I wrote my MWE to be simple. In real-world code, you likely want to move the entire loop, of which this inner loop is part, out of main() and into a function such as

elem_t* func( const ptrdiff_t nelems,
              const elem_t n[nelems],
              const elem_t m[nelems],
              elem_t a[nelems]
            )
{
    for ( ptrdiff_t  i = 0; i < nelems; ++i) {
      a[i] = (n[i]*n[i] + n[i])/2 + m[i];
    }

  return a;
}

If you compare the generated assembly, you will see that it is not as efficient unless you inline it, largely because the compiler no longer knows the number of iterations at compile-time or has any information about the alignment of n, m or a.

You could also save some memory, but probably not computation time, by storing the input elements as uint16_t. The input arrays use half as much memory, but the loop cannot operate on more elements than before because the computation uses elements the same size. Be careful to cast the temporary values you use for the calculation to a type that will not overflow!

#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>

#define N (1L<<20)

const uint16_t n[N];
const uint16_t m[N];
uint32_t a[N];

int main(void)
{
    for ( ptrdiff_t  i = 0; i < N; ++i) {
      a[i] = ((uint32_t)n[i]*n[i] + n[i])/2 + m[i];
    }

  return EXIT_SUCCESS;
}
Davislor
  • 14,674
  • 2
  • 34
  • 49
  • > There is no reason to specify transformations such as >>1 for /2; the compiler is smart enough to do them for you. Not true. Change your definition for elem_t so that it is signed (and not unsigned). – robthebloke Aug 30 '19 at 06:21
  • That may well be the case, but that's the data type the OP is using. – robthebloke Aug 30 '19 at 06:24
  • 2
    With signed types, the result of right-shifting a negative value is implementation-defined. (It might be arithmetic or logical shift). Since the OP specified that the values are non-negative, it is safe in this case to use unsigned values, or an explicit right-shift if the compiler can’t figure it out. – Davislor Aug 30 '19 at 06:26
  • 1
    Are you really sure it's faster to go through a vector on any platform ? I mean, did you test it ? – Guillaume Petitjean Aug 30 '19 at 07:29
  • 1
    Also, I may be missing the point, but the OP didn't ask for a loop computing every value, but for a function computing efficiently a given value. What you are proposing is basically to compute every possible value in advance in array `a`. But if it is the chosen solution it does not really matter to optimise a loop that is done only once at startup – Guillaume Petitjean Aug 30 '19 at 07:33
  • @GuillaumePetitjean I haven’t benchmarked it. I’m not sure what the use case is, either. If, for example, these will be generated sporadically and consumed immediately, it might not be suitable to buffer a large number of SIMD operations in arrays. However, if the problem can be expressed this way, vectorization will be faster on any modern hardware. – Davislor Aug 30 '19 at 07:33
  • @GuillaumePetitjean This solution would require refactoring the program to perform the operation in large batches. This need not be once at startup; the input arrays could be populated at runtime. If the problem does not allow that, the approach is unsuitable for this program—but possibly suitable for the next person who reads these answers. – Davislor Aug 30 '19 at 07:34
  • @robthebloke "There is no reason to specify transformations such as >>1 for /2; the compiler is smart enough" Well the quick test I did (see my answer) shows the contrary – Guillaume Petitjean Aug 30 '19 at 07:58
  • @GuillaumePetitjean I deleted that line and added a note about it. If you use unsigned variables, as I did, there is no need to specify it. If you use signed variables and the compiler is unable to tell that they are always non-negative, you might need to give it a hint. – Davislor Aug 30 '19 at 08:01
  • @GuillaumePetitjean If you were my downvoter, you’re welcome to your opinion, but I ask you to reconsider. Vectorization and parallelization are important techniques in HPC that will be useful to anyone else who finds this question later. – Davislor Aug 30 '19 at 08:08
2

At the end the question is: can you really optimize more than the straightforward implementation you did ?

Here is a quick test using arm-none-eabi-gcc with -O2 optimisation level: see here

int func(int n, int m) 
{ 
    return ( (n*(n+1) >> 1) + m); 
}

compiles in:

func(int, int):
        mla     r3, r0, r0, r0
        add     r0, r1, r3, asr #1
        bx      lr

So two assembly instructions (excluding the bx lr that will disappear with inlining). I don't see how you can do a quicker implementation.

EDIT: just for fun, if you compile with level -O0 here is what you got:

func(int, int):
        str     fp, [sp, #-4]!
        add     fp, sp, #0
        sub     sp, sp, #12
        str     r0, [fp, #-8]
        str     r1, [fp, #-12]
        ldr     r3, [fp, #-8]
        add     r3, r3, #1
        ldr     r2, [fp, #-8]
        mul     r3, r2, r3
        mov     r2, r3, asr #1
        ldr     r3, [fp, #-12]
        add     r3, r2, r3
        mov     r0, r3
        sub     sp, fp, #0
        ldr     fp, [sp], #4
        bx      lr

GCC can be very clever, you only have to tell him to be ;)

Guillaume Petitjean
  • 2,408
  • 1
  • 21
  • 47
2

I think a better approach would be to ask if you really need to compute this so many times. For example if n is constant in the inner loop, could you calculate n*(n+1)/2 outside? (Though it is possible that an optimising compiler will do this anyway). Alternatively if you are incrementing n in the inner loop perhaps you could use

(n+1)*(n+2)/2 = n*(n+1)/2 + n + 1

to update n*(n+1)/2 instead of calculating it afresh each time.

dmuir
  • 4,211
  • 2
  • 14
  • 12
0

You can use direct assembly instructions. In VC++ you can use __asm keyword to start assembly section. You can use a regular function and use this section inside there. And call that function normally. For gcc based you can use asm().

Mike
  • 4,041
  • 6
  • 20
  • 37
XPD
  • 1,121
  • 1
  • 13
  • 26
0

You say "this will be needed hundreds of millions of times", as if that's a lot. But these days, hundreds of millions of times is nothing.

I just wrote an obvious little program to perform n*(n+1)/2 + m 100,000,000 times. I did absolutely nothing fancy to try to make it "efficient". On an ordinary consumer-grade laptop, it ran in about half a second -- which is too fast to even time it accurately. So then I tried it for 100 times as long: 10,000,000,000 times. In that case it took about 52 seconds, which works out to about 5.2 nanoseconds per computation. (And there was some overhead involved, so the actual time per computation was even less.)

Let's say you spend one hour trying to speed this function up. (You've probably spent almost that much time just posting your question to Stack Overflow and reading responses, not to mention the time we all spent replying.) Let's say you manage to speed it up by 50% (that is, make it twice as fast). Based on my result, you would have to run the function something like 1.4e12 times (that's over a trillion times) before you got your hour back.

So if you're going to be running this computation many trillions of times (not just hundreds of millions of times), then maybe (maybe!) spend some time trying to speed it up. Otherwise -- sorry to be a downer about this -- just don't bother.

See also this answer to a somewhat similar question.

(I'm not trying to suggest that efficiency is never important, but it's also important to keep your actual situation in perspective.)

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • 1
    On an embedded device, with a basic MCU, running 100 million of times a single operation is not negligeable at all and anyway even half a second might make the difference between a working and a not working (or not user friendly) application. It is pretty common to spend hours (or even days or weeks) to save a few cycles per loop. – Guillaume Petitjean Aug 30 '19 at 12:36
  • If 1 million people download/install your software and each person (on average) runs the software once per day for 3 years; then it's going to be executed 1 billion times; assuming that the calculation is only done once at process startup (each time the process is started). If the calculation is needed hundreds of millions of times each time the process is started; then we'd be looking at a total cost of "hundreds of millions of billions". – Brendan Aug 30 '19 at 13:19
  • @Brendan Sure. And that's why I was careful to say, "I'm not trying to suggest that efficiency is never important". But the OP did *not* say he was running on an embedded device. He did *not* say he expected his software to be downloaded billions of times and that each copy would execute the instructions in question hundreds of millions of times per copy. He said merely that he had to execute it hundreds of millions of times, and if that's true, I don't think he needs to worry. – Steve Summit Aug 30 '19 at 14:06
  • @GuillaumePetitjean *It is pretty common to spend hours... to save a few cycles per loop.* I understand. It is also pretty common to spend similar amounts of time microoptimizing code that truly doesn't need it, achieving nothing in the end other than wasting time and making the code buggier and harder to maintain. So it's a question of balance. – Steve Summit Aug 30 '19 at 14:08
-1

You can use this recursive algorithm which multiplies two integers without actually using multiplication operation. Also uses minimal number of other arithmetic operations.

Note that the conventional way multiplying two numbers has complexity of O(M*N) but this function multiplies in O(log(N)), where N smaller number.

There is another algorithm to multiply two integers called karatsuba algo but i don't think you will require this as this is better suited if multiplying numbers are too large.

Hansraj Das
  • 209
  • 4
  • 7
  • Your function is unlikely to be faster than what OP has. If there is a multiply instruction on the target machine, it will almost surely be faster than multiple branches and recursions. If there wasn't, the compiler would substitute a good one for the architecture. The asymptotic complexity is irrelevant here because OPs numbers are bounded and can be multiplied in one op. – walnut Sep 21 '19 at 10:38