-6

I've got an assignment where I must take a program and make it more efficient in terms of time. the original code is:

#include <stdio.h>
#include <stdlib.h>

// You are only allowed to make changes to this code as specified by the comments in it.

// The code you submit must have these two values.
#define N_TIMES     600000
#define ARRAY_SIZE   10000

int main(void)
{
    double  *array = calloc(ARRAY_SIZE, sizeof(double));
    double  sum = 0;
    int     i;

    // You can add variables between this comment ...
    long int help;
    // ... and this one.

    // Please change 'your name' to your actual name.
    printf("CS201 - Asgmt 4 - I. Forgot\n");

    for (i = 0; i < N_TIMES; i++) {

        // You can change anything between this comment ...

        int     j;

        for (j = 0; j < ARRAY_SIZE; j++) {
            sum += array[j];
            help++;
            }

        // ... and this one. But your inner loop must do the same
        // number of additions as this one does.

        }
    // You can add some final code between this comment ...

    // ... and this one.

    return 0;
}

I almost exclusively modified the second for loop by changing it to

double *j=array;
double *p=array+ARRAY_SIZE;

for(; j<p;j+=10){
    sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
{

this on its own was able to reduce the time down to the criteria... it already seems to work but are there any mistakes i'm not seeing?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Aj Mallison
  • 17
  • 1
  • 2
  • The most obvious optimisation is to ditch the outer loop and change the inner loop to `sum += array[j] * N_TIMES;`, but that seems to not be allowed. (Of course, since the array is full of zeroes, you can replace both loops with `sum = 0;`...) – molbdnilo Aug 10 '15 at 12:05
  • Your optimazation should also be done by the compiler, which compiler/settings do you use? I would suggest to do multithreding and "divide et impera" – flotto Aug 10 '15 at 12:08
  • Did your code change work? Looks to me that it can't even compile. – Support Ukraine Aug 10 '15 at 12:17
  • @flotto this question has exactly nothing to do with "compilers". – The Paramagnetic Croissant Aug 10 '15 at 12:17
  • You could try a [Duff's device](https://en.wikipedia.org/wiki/Duff%27s_device) (or other [loop unrolling](https://en.wikipedia.org/wiki/Loop_unrolling)) but it may well run slower. – Kninnug Aug 10 '15 at 12:18
  • @TheParamagneticCroissant But the question is about optimization, and optimization is very compiler releated, because you should know how a compiler works to do some micro optimization. For example: The thread starter did some loop unrolling, but the compiler should do the same automaticly. – flotto Aug 10 '15 at 12:23
  • @flotto non sequitur. The task is to perform optimizations manually. If the question was about compiler optimizations, then OP would presumably have no task, since his micro-optimizations would surely done by any reasonably modern compiler. – The Paramagnetic Croissant Aug 10 '15 at 12:26
  • @TheParamagneticCroissant However, the question is not a code review. But without any information about toolchain and targert system it's hard to give an good advice. – flotto Aug 10 '15 at 12:34
  • It's already running fast enough, the only change I've made is the one you see infront of you, though i DID declare j* and p* outside of the outer loop, also i just realized a typo it should say array+ARRAY_size instead of array_ sorry :S – Aj Mallison Aug 10 '15 at 12:59
  • it should be j+=10, not j+10 – samgak Aug 10 '15 at 13:04
  • oh crap, typo, i'm sorry – Aj Mallison Aug 10 '15 at 13:05
  • A problem with your solution is that ARRAY_SIZE must be 10 * N Otherwise your code will read outside the array – Support Ukraine Aug 10 '15 at 14:37
  • @Kninnug: You don't need Duff's device to jump into an unrolled loop at an arbitrary point. The iteration count is a compile-time constant, so simple unrolling is sufficient. (Of course, unrolling with multiple accumulators is *much* better, especially for FP math, because of the longer instruction latencies.) – Peter Cordes Aug 10 '15 at 19:04
  • but... ARRAY_SIZE is a multiple of 10... so it still works no? @StillLearning – Aj Mallison Aug 10 '15 at 22:23
  • @AjMallison: Since array-size is a multiple of your unroll, you didn't need any cleanup code at the end to handle the last few elements. If you were writing a function that had to work on an arbitrary input sizes, you would have needed some. – Peter Cordes Aug 11 '15 at 05:02
  • 1
    @AjMallison - True - currently ARRAY_SIZE is a multiple of 10 but ... When a piece of code contains a #define of some size, it is usually expected that the value of the #define can be changed to (any) other value without breaking the intended functionality. That is best practice. As an alternative you should add code checking that ARRAY_SIZE is indeed a multiple of 10 and give an error (compile time) if that isn't the case. – Support Ukraine Aug 11 '15 at 06:55

3 Answers3

8

I posted an improved version of this answer on a duplicate of this: C loop optimization help for final assignment. It was originally just a repost, but then I made some changes to answer the differences in that question. I forget what's different, but you should probably read that one instead. Maybe I should just delete this one.

See also other optimization guides in the tag wiki.


First of all, it's a really crap sample because it doesn't have anything to stop a smart compiler from optimizing away the entire thing. It doesn't even print the sum. Even gcc -O1 (instead of -O3) threw away some of the looping.

Normally you'd put your code in a function, and call it in a loop from main() in another file. And compile them separately, without whole-program cross-file optimisation, so the compiler can't do optimisations based on the compile-time constants you call it with. The repeat-loop being wrapped so tightly around the actual loop over the array is causing havoc with gcc's optimizer (see below).

Also:

gcc -Wall -O3 -march=native  fast-loop-cs201.c -o fl
fast-loop-cs201.c: In function ‘main’:
fast-loop-cs201.c:17:14: warning: ‘help’ is used uninitialized in this function [-Wuninitialized]
     long int help; 

I have to agree with EOF's disparaging remarks about your prof. Giving out code that optimizes away to nothing, and with uninitialized variables, is utter nonsense.

Some people are saying in comments that "the compiler doesn't matter", and that you're supposed to do optimize your C source for the CPU microarchitecture, rather than letting the compiler do it. This is crap: for good performance, you have to be aware of what compilers can do, and can't do. Some optimizations are "brittle", and a small seemingly-innocent change to the source will stop the compiler from doing something.

I assume your prof mentioned a few things about performance. There are a crapton of different things that could come into play here, many of which I assume didn't get mentioned in a 2nd-year CS class.

Besides multithreading with openmp, there's vectorizing with SIMD. There are also optimizations for modern pipelined CPUs: specifically, avoid having one long dependency chain.

Further essential reading:

Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. However, usually the difference in rounding error is small. So the compiler can get a big speedup by re-ordering things if you use -ffast-math to allow it. This may have been what your unroll-by-10 allowed.

Instead of just unrolling, keeping multiple accumulators which you only add up at the end can keep the floating point execution units saturated, because FP instructions have latency != throughput. If you need the result of the last op to be complete before the next one can start, you're limited by latency. For FP add, that's one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it's 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)

In this case, there's also basic stuff like pulling things out of the loop, e.g. help += ARRAY_SIZE.

compiler options

I started out with the original inner loop, with just help += ARRAY_SIZE pulled out, and adding a printf at the end so gcc doesn't optimize everything away. Let's try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):

  • gcc -O0 fast-loop-cs201.c -o fl: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing / tuning code with -O0 is not useful.
  • -O1: 4.87s
  • -O2: 4.89s
  • -O3: 2.453s (uses SSE to do 2 at once. I'm of course using a 64bit system, so hardware support for -msse2 is baseline.)
  • -O3 -ffast-math -funroll-loops: 2.439s
  • -O3 -march=sandybridge -ffast-math -funroll-loops: 1.275s (uses AVX to do 4 at once.)
  • -Ofast ...: no gain
  • -O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win (because it collects the sums every time, instead of giving one thread the first 1/4 of the outer loop iterations).
  • -Ofast -fprofile-generate -march=sandybridge -ffast-math, run it, then
    -Ofast -fprofile-use -march=sandybridge -ffast-math: 1.275s

  • clang-3.5 -Ofast -march=native -ffast-math: 1.070s. (clang doesn't support -march=sandybridge).

gcc -O3 vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an addpd on that. So it sees the same values are being added repeatedly, but even -ffast-math doesn't let gcc just turn it into a multiply. Or switch the loops.

clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn't need to broadcast. It even uses 4 vector registers as 4 separate accumulators. However, it doesn't assume that calloc returns aligned memory, and for some reason it thinks the best bet is a pair of 128b loads.

vmovupd -0x60(%rbx,%rcx,8),%xmm4`
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4

It's actually slower when I tell it that the array is aligned. (with a stupid hack like array = (double*)((ptrdiff_t)array & ~31); which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn't support gcc's __builtin_assume_aligned.) I think the way the tight loop of 4x vaddpd mem, %ymmX,%ymmX is aligned puts cmp $0x271c,%rcx crossing a 32B boundary, so it can't macro-fuse with jne. uop throughput shouldn't be an issue, though, since this code is only getting 0.65insns per cycle (and 0.93 uops / cycle), according to perf.

Ahh, I checked with a debugger, and calloc is only returning a 16B-aligned pointer. So half the 32B memory accesses are crossing a cache line, causing a big slowdown. I guess it is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. The compiler is making a good choice here.

Source level changes

As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:

for (j = 0; j < ARRAY_SIZE; j+=4) {  // unroll 4 times
    sum0 += array[j];
    sum1 += array[j+1];
    sum2 += array[j+2];
    sum3 += array[j+3];
}

and then don't collect the 4 accumulators into one until after the end of the outer loop.

Your source change of

sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];

actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the j values get added together first, and then added to sum. So the loop-carried dependency chain is still only the latency of one FP add, and there's lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.

With -O0, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. (Technically, at every "sequence point", as the C standards call it.) Writing longer expressions without updating any variables, even temporaries, will make -O0 run faster, but it's not a useful optimisation. Don't waste your time on changes that only help with -O0, esp. not at the expense of readability.


Using 4-accumulators and not adding them together until the end of the outer loop defeats clang's auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc's non-vectorized -O2 with one accumulator). Even gcc -O2 without -ffast-math also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn't include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It's really easy to get something wrong and read past the end of the array when doing this.

gcc, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it's doing multiple iterations of the outer loop, again.


Using gcc's platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:

// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec

#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>

// You are only allowed to make changes to this code as specified by the comments in it.

// The code you submit must have these two values.
#define N_TIMES     600000
#define ARRAY_SIZE   10000

int main(void)
{
    double  *array = calloc(ARRAY_SIZE, sizeof(double));
    double  sum = 0;
    int     i;

    // You can add variables between this comment ...
    long int help = 0;

    typedef double v4df __attribute__ ((vector_size (8*4)));
    v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};

    const size_t array_bytes = ARRAY_SIZE*sizeof(double);
    double *aligned_array = NULL;

    // this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
    if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
        exit (1);
    }
    memcpy(aligned_array, array, array_bytes);  // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop

    // ... and this one.

    // Please change 'your name' to your actual name.
    printf("CS201 - Asgmt 4 - I. Forgot\n");

    for (i = 0; i < N_TIMES; i++) {

        // You can change anything between this comment ...
    /*
    #if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
        array = __builtin_assume_aligned(array, 32);
    #else
        // force-align for other compilers.  This loop-invariant will be done outside the loop.
        array = (double*) ((ptrdiff_t)array & ~31);
    #endif
    */

        assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) );  // We don't have a cleanup loop to handle where the array size isn't a multiple of 16


        // incrementing pointers can be more efficient than indexing arrays
        // esp. on recent Intel where micro-fusion only works with one-register addressing modes
        // of course, the compiler can always generate pointer-incrementing asm from array-indexing source
        const double *start = aligned_array;

        while ( (ptrdiff_t)start & 31 ) {
            // annoying loops like this are the reason people use aligned buffers
            sum += *start++;        // scalar until we reach 32B alignment
            // in practice, this loop doesn't run, because we copy into an aligned buffer
            // This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
        }

        const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
        for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
            sum0 += p[0];   // p+=4 increments the pointer by 4 * 4 * 8 bytes
            sum1 += p[1];       // make sure you keep track of what you're incrementing
            sum2 += p[2];
            sum3 += p[3];

        }

        // the compiler might be smart enough to pull this out of the inner loop
        // in fact, gcc turns this into a 64bit movabs outside of both loops :P
        help+= ARRAY_SIZE;

            // ... and this one. But your inner loop must do the same
            // number of additions as this one does.

        /* You could argue legalese and say that
         if (i == 0) {
             for (j ...)
                 sum += array[j];
             sum *= N_TIMES;
         }
         * still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
         */
    }

    // You can add some final code between this comment ...
    sum0 = (sum0 + sum1) + (sum2 + sum3);
    sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
    printf("sum = %g; help=%ld\n", sum, help);  // defeat the compiler.

    free (aligned_array);
    free (array);  // not strictly necessary, because this is the end of main().  Leaving it out for this special case is a bad example for a CS class, though.
    // ... and this one.

    return 0;
}

The inner loop compiles to:

  4007c0:       c5 e5 58 19             vaddpd (%rcx),%ymm3,%ymm3
  4007c4:       48 83 e9 80             sub    $0xffffffffffffff80,%rcx   # subtract -128, because -128 fits in imm8 instead of requiring an imm32 to encode add $128, %rcx
  4007c8:       c5 f5 58 49 a0          vaddpd -0x60(%rcx),%ymm1,%ymm1   # one-register addressing mode can micro-fuse
  4007cd:       c5 ed 58 51 c0          vaddpd -0x40(%rcx),%ymm2,%ymm2
  4007d2:       c5 fd 58 41 e0          vaddpd -0x20(%rcx),%ymm0,%ymm0
  4007d7:       4c 39 c1                cmp    %r8,%rcx  # compare with end with p
  4007da:       75 e4                   jne    4007c0 <main+0xb0>

(For more, see online compiler output at godbolt. Note I had to cast the return value of calloc, because godbolt uses C++ compilers, not C compilers. The inner loop is from .L3 to jne .L3. See https://stackoverflow.com/tags/x86/info for x86 asm links. See also Micro fusion and addressing modes, because this Sandybridge change hasn't made it into Agner Fog's manuals yet.).

performance:

$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec 
CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000

 Performance counter stats for './fl3-vec':

       1086.571078      task-clock (msec)         #    1.000 CPUs utilized          
     4,072,679,849      cycles                    #    3.748 GHz                    
     2,629,419,883      instructions              #    0.65  insns per cycle        
                                                  #    1.27  stalled cycles per insn
     4,028,715,968      r1b1                      # 3707.733 M/sec  # unfused uops
     2,257,875,023      r10e                      # 2077.982 M/sec  # fused uops.  lower than insns because of macro-fusion
     3,328,275,626      stalled-cycles-frontend   #   81.72% frontend cycles idle   
     1,648,011,059      stalled-cycles-backend    #   40.47% backend  cycles idle   
       751,736,741      L1-dcache-load-misses     #  691.843 M/sec                  
            18,772      cache-misses              #    0.017 M/sec                  

       1.086925466 seconds time elapsed

I still don't know why it's getting such low instructions per cycle. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts shouldn't be the problem. Sandybridge L2 cache can sustain one 32B transfers per cycle, which should keep up with the one 32B FP vector add per cycle.

Loads 32B loads from L1 take 2 cycles (it wasn't until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we're not reaching).

Perhaps the loads need to be pipelined ahead of when they're used, to minimize having the ROB (re-order buffer) fill up when a load stalls? But the perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.

0.65 instructions per cycle is only about half way to saturating the vector FP adder. This is frustrating. Even IACA says the loop should run in 4 cycles per iteration. (i.e. saturate the load ports and port1 (where the FP adder lives)) :/

update: I guess L2 latency was the problem after all. Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That's 1.68 insns per cycle. (The inner loop is 7 total instructions for 4 FP adds, thus we are finally saturating the vector FP add unit, and the load ports.) IDK why the HW prefetcher can't get ahead after one stall, and then stay ahead. Possibly software prefetch could help? Maybe somehow avoid having the HW prefetcher run past the array, and instead start prefetching the start of the array again. (loop tiling is a much better solution, see below.)

Intel CPUs only have 32k each L1-data and L1-instruction caches. I think your array would just barely fit in the L1 on an AMD CPU.

Gcc's attempt to vectorize by broadcasting the same value into a parallel add doesn't seem so crazy. If it had managed to get this right (using multiple accumulators to hide latency), that would have allowed it to saturate the vector FP adder with only half the memory bandwidth. As-is, it was pretty much a wash, probably because of overhead in broadcasting.

Also, it's pretty silly. The N_TIMES is a just a make-work repeat. We don't actually want to optimize for doing the identical work multiple times. Unless we want to win at silly assignments like this. A source-level way to do this would be to increment i in the part of the code we're allowed to modify:

for (...) {
    sum += a[j] + a[j] + a[j] + a[j];
}
i += 3;  // The inner loop does 4 total iterations of the outer loop

More realistically, to deal with this you could interchange your loops (loop over the array once, adding each value N_TIMES times). I think I've read that Intel's compiler will sometimes do that for you.

A more general technique is called cache blocking, or loop tiling. The idea is to work on your input data in small blocks that fit in cache. Depending on your algorithm, it can be possible to do various stages of thing on a chunk, then repeat for the next chunk, instead of having each stage loop over the whole input. As always, once you know the right name for a trick (and that it exists at all), you can google up a ton of info.

You could rules-lawyer your way into putting an interchanged loop inside an if (i == 0) block in the part of the code you're allowed to modify. It would still do the same number of additions, but in a more cache-optimal order.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • that's nice but... I forgot to mention i'm not allowed to use compiler functions to make it more efficient. – Aj Mallison Aug 10 '15 at 22:32
  • @AjMallison: That's insane. I expect the main reason your `sum += a[0] + a[1] + a[2] + ... a[9]` source change was faster with `-O0` was doing more in one expression, leading to the compiler doing 10 adds in registers without storing to RAM. It does also change the order of operations from one long loop-carried dependency chain to a series of independent dep chains (of groups of 10), with enough work to hide the latency of adding it to `sum`. Part of my answer explained why this was helpful. The usual technique is to use multiple accumulators, but this does work. – Peter Cordes Aug 11 '15 at 03:55
  • I already explained in my answer why benchmarking / performance-tuning code at `-O0` was silly and not always related to performance when compiling for real. Some of my answer does focus on source-level changes that might help even at `-O0`, but I'm not going to add anything about writing bigger expressions so the compiler doesn't store to RAM as often! That's not a useful thing to learn. – Peter Cordes Aug 11 '15 at 04:03
  • 1
    This foolish question just got asked again. It seems to be a programming assignment for some course in Portland Community College (CS201), and it's the most ridiculous "optimization" assignment I have ever witnessed. – vgru Aug 22 '17 at 22:03
  • 1
    @Groo: Thanks, found it and dup-hammered it. – Peter Cordes Aug 22 '17 at 22:51
2

I would try this for the inner loop:

    double* tmp = array;
    for (j = 0; j < ARRAY_SIZE; j++) {
        sum += *tmp;  // Use a pointer
        tmp++;        // because it is faster to increment the pointer
                      // than it is to recalculate array+j every time
        help++;
    }

or better

double* tmp = array;
double* end = array + ARRAY_SIZE; // Get rid of variable j by calculating 
                                  // the end criteria and
while (tmp != end) {              // just compare if the end is reached
    sum += *tmp;
    tmp++;
    help++;
}
Support Ukraine
  • 42,271
  • 4
  • 38
  • 63
  • Would you care to explain why your alternatives are better? – Mast Aug 10 '15 at 12:35
  • but... wouldn't that be slightly less efficient? since now j is an integer instead of a pointer? – Aj Mallison Aug 10 '15 at 13:01
  • I did a quick check on my system (using another N_TIMES value). The original code took 28 sec. My first proposal 23 sec and my second proposal was 20 sec. BTW - I don't understand your question about j as a pointer - j has never been a pointer. – Support Ukraine Aug 10 '15 at 13:23
  • The measurements in my comment above was done with -o0 but if I change that to -o3 (and adds a print of sum in the end of the program to avoid that everything is optimized away) the execution time is 10 sec. – Support Ukraine Aug 10 '15 at 14:01
  • @Mast - For the first proposal my idea was that it was faster to increment a pointer instead of always recalculating the new address (i.e. calculating array + j). For the second proposal my idea was to get rid of j so I didn't need to spend time on incrementing j. – Support Ukraine Aug 10 '15 at 14:18
  • I also checked the original code compiled with -o3 and compared it to my second proposal compiled with -o3. The result was in both case four instructions, i.e. `faddl, add, cmp, jne` and execution time was the same (of cause). – Support Ukraine Aug 10 '15 at 14:41
0

I think You should read about library if You could use multithreaded. But this is so simple example that I think could not be optimized.

Certain thing is that You don't need to declare i and j before for loop. That would do:

for (int i = 0; i < N_TIMES; i++)
frogatto
  • 28,539
  • 11
  • 83
  • 129
  • I was actually considering threading, but I wasn't sure if that might be against the rules, since that would mean i'd be messing with how the variable i is incremented, even if all the additions happen anyhow, either way, once I realized the code was already fast enough, I felt like I solved the problem too easily. – Aj Mallison Aug 10 '15 at 12:52