4

I'm studying for an exam and am trying to follow this problem: I have the following C code to do some array initialisation:

int i, n = 61440;
double x[n];
for(i=0; i < n; i++) {
  x[i] = 1;
}

But the following runs faster (0.5s difference in 1000 iterations):

int i, n = 61440;
double x[n];
for(i=n-1; i >= 0; i--) {
  x[i] = 1;
}

I first thought that it was due to the loop accessing the n variable, thus having to do more reads (as suggested here for example: Why is iterating through an array backwards faster than forwards). But even if I change the n in the first loop to a hard coded value, or vice versa move the 0 in the bottom loop to a variable, the performance remains the same. I also tried to change the loops to only do half the work (go from 0 to < 30720, or from n-1 to >= 30720), to eliminate any special treatment of the 0 value, but the bottom loop is still faster

I assume it is because of some compiler optimisations? But everything I look up for the generated machine code suggests, that < and >= ought to be equal.

Thankful for any hints or advice! Thank you!

Edit: Makefile, for compiler details (this is part of a multi threading exercise, hence the OpenMP, though for this case it's all running on 1 core, without any OpenMP instructions in the code)

#CC = gcc

CC = /opt/rh/devtoolset-2/root/usr/bin/gcc
OMP_FLAG = -fopenmp
CFLAGS = -std=c99 -O2 -c ${OMP_FLAG}
LFLAGS = -lm

.SUFFIXES : .o .c

.c.o:
    ${CC} ${CFLAGS} -o $@ $*.c

sblas:sblas.o
    ${CC} ${OMP_FLAG} -o $@ $@.o ${LFLAGS}

Edit2: I redid the experiment with n * 100, getting the same results: Forward: ~170s Backward: ~120s Similar to the previous values of 1.7s and 1.2s, just times 100

Edit3: Minimal Example - changes described above where all localized to the vector update method. This is the default forward version, which takes longer than the backwards version for(i = limit - 1; i >= 0; i--)

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>

void vector_update(double a[], double b[], double x[], int limit);

/* SBLAS code */

void *main() {
    int n = 1024*60;
    int nsteps = 1000;
    int k;

    double a[n], b[n], x[n];

    double vec_update_start;
    double vec_update_time = 0; 

    for(k = 0; k < nsteps; k++) {
    // Loop over whole program to get reasonable execution time
    // (simulates a time-steping code)
        vec_update_start = omp_get_wtime();
        vector_update(a, b, x, n);
        vec_update_time = vec_update_time + (omp_get_wtime() - vec_update_start);
   }

    printf( "vector update time = %f seconds \n \n", vec_update_time);
}

void vector_update(double a[], double b[], double x[] ,int limit) {
    int i;
    for (i = 0; i < limit; i++ )  {
        x[i] = 0.0;
        a[i] = 3.142;
        b[i] = 3.142;
    }
}

Edit4: the CPU is AMD quad-core Opteron 8378. The machine uses 4 of those, but I'm using only one on the main processor (core ID 0 in the AMD architecture)

CrankMuffler
  • 101
  • 6

2 Answers2

1

It's not the backward iteration but the comparison with zero which causes the loop in the second case run faster.

for(i=n-1; i >= 0; i--) {

Comparison with zero can be done with a single assembly instruction whereas comparison with any other number takes multiple instructions.

Sandy
  • 895
  • 6
  • 17
  • 3
    I have tried to have the loops do half the work, so it does not check against 0 and it is still faster: `for(i = n-1; i >= 30720; i --)` Is still faster than `for(i = 0; i < 30720; i++)` – CrankMuffler Oct 16 '18 at 11:24
  • @CrankMuffler By "half the work" do you mean loop from n to n/2 instead of 0 ? – Sandy Oct 16 '18 at 11:27
  • 6
    This is 1980s/1990s optimizations, nowadays obsolete and "pre-mature". Compilers should nowadays be able to generate the most effective code without the programmer having to write the check against zero explicitly. So this doesn't explain anything. – Lundin Oct 16 '18 at 11:31
0

The main reason is that your compiler isn't very good at optimising. In theory there's no reason that a better compiler couldn't have converted both versions of your code into the exact same machine code instead of letting one be slower.

Everything beyond that depends on what the resulting machine code is and what it's running on. This can include differences in RAM and/or CPU speeds, differences in cache behaviour, differences in hardware prefetching (and number of prefetchers), differences in instruction costs and instruction pipelining, differences in speculation, etc. Note that (in theory) this doesn't exclude the possibility that (on most computers but not on your computer) the machine code your compiler generates for forward loop is faster than the machine code it generates for backward loop (your sample size isn't large enough to be statistically significant, unless you're working on embedded systems or game consoles where all computers that run the code are identical).

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • I'm working on a machine that guarantees that once the program start, that core will not be interupted till it's done – CrankMuffler Oct 16 '18 at 13:35
  • @CrankMuffler: Erm, that's nice. Is it a modern 80x86 CPU where the hardware prefetcher is able to detect and prefetch from 3 streams in either direction (in addition to having "TLB hardware prefetch"); or a slightly older 80x86 CPU or slightly more exotic 80x86 CPU (e.g. Xeon Phi) or not 80x86 at all or...? – Brendan Oct 16 '18 at 13:42
  • 2
    @CrankMuffler: My point here is that to understand the low-level differences you'd need to look at a disassembly and you'd need to know which CPU (which model from which manufacturer). By choosing to use a high level language you've chosen to delegate all "micro-optimisation decisions" to a third party (the compiler), and to investigate what the third party is doing we need more than the high level source code. – Brendan Oct 16 '18 at 13:50
  • Ye I get it - just wanted to say that it's verifiable and not caused by random tasks interrupting the process – CrankMuffler Oct 16 '18 at 13:54