I'm studying for an exam and am trying to follow this problem: I have the following C code to do some array initialisation:
int i, n = 61440;
double x[n];
for(i=0; i < n; i++) {
x[i] = 1;
}
But the following runs faster (0.5s difference in 1000 iterations):
int i, n = 61440;
double x[n];
for(i=n-1; i >= 0; i--) {
x[i] = 1;
}
I first thought that it was due to the loop accessing the n variable, thus having to do more reads (as suggested here for example: Why is iterating through an array backwards faster than forwards). But even if I change the n in the first loop to a hard coded value, or vice versa move the 0 in the bottom loop to a variable, the performance remains the same. I also tried to change the loops to only do half the work (go from 0 to < 30720, or from n-1 to >= 30720), to eliminate any special treatment of the 0 value, but the bottom loop is still faster
I assume it is because of some compiler optimisations? But everything I look up for the generated machine code suggests, that < and >= ought to be equal.
Thankful for any hints or advice! Thank you!
Edit: Makefile, for compiler details (this is part of a multi threading exercise, hence the OpenMP, though for this case it's all running on 1 core, without any OpenMP instructions in the code)
#CC = gcc
CC = /opt/rh/devtoolset-2/root/usr/bin/gcc
OMP_FLAG = -fopenmp
CFLAGS = -std=c99 -O2 -c ${OMP_FLAG}
LFLAGS = -lm
.SUFFIXES : .o .c
.c.o:
${CC} ${CFLAGS} -o $@ $*.c
sblas:sblas.o
${CC} ${OMP_FLAG} -o $@ $@.o ${LFLAGS}
Edit2: I redid the experiment with n * 100, getting the same results: Forward: ~170s Backward: ~120s Similar to the previous values of 1.7s and 1.2s, just times 100
Edit3: Minimal Example - changes described above where all localized to the vector update method. This is the default forward version, which takes longer than the backwards version for(i = limit - 1; i >= 0; i--)
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
void vector_update(double a[], double b[], double x[], int limit);
/* SBLAS code */
void *main() {
int n = 1024*60;
int nsteps = 1000;
int k;
double a[n], b[n], x[n];
double vec_update_start;
double vec_update_time = 0;
for(k = 0; k < nsteps; k++) {
// Loop over whole program to get reasonable execution time
// (simulates a time-steping code)
vec_update_start = omp_get_wtime();
vector_update(a, b, x, n);
vec_update_time = vec_update_time + (omp_get_wtime() - vec_update_start);
}
printf( "vector update time = %f seconds \n \n", vec_update_time);
}
void vector_update(double a[], double b[], double x[] ,int limit) {
int i;
for (i = 0; i < limit; i++ ) {
x[i] = 0.0;
a[i] = 3.142;
b[i] = 3.142;
}
}
Edit4: the CPU is AMD quad-core Opteron 8378. The machine uses 4 of those, but I'm using only one on the main processor (core ID 0 in the AMD architecture)