I don't know how closely your sample code matches your application, but if you are looping over rows like that, you are almost certainly running into cache problems. If I code your loops in row-major and column-major order, I see drastic performance differences.
With nrow=1000000
and ncol=1000
, if I use array[i][0]
, I get a runtime of about 1.9 s. If I use array[0][i]
, then it drops to 0.05s.
If it's possible for you to transpose your data in this way, you should see a large performance boost.
#ifdef COL_MAJOR
array = (double **)malloc(nrow * sizeof(double *));
for(i=0; i<nrow; i++) {
array[i] = (double *)malloc(ncol * sizeof(double));
array[i][0] = i;
}
for(i=0; i<nrow; i++) {
sum += array[i][0];
}
for(i=0; i<nrow; i++) {
array[i][0] /= sum;
}
#else
array = (double **)malloc(ncol * sizeof(double *));
for(i=0; i<ncol; i++) {
array[i] = (double *)malloc(nrow * sizeof(double));
}
for(i=0; i<nrow; i++) {
array[0][i] = i;
}
for(i=0; i<nrow; i++) {
sum += array[0][i];
}
for(i=0; i<nrow; i++) {
array[0][i] /= sum;
}
#endif
printf("%f\n", sum);
$ gcc -DCOL_MAJOR -O2 -o normed normed.c
$ time ./normed
499999500000.000000
real 0m1.904s
user 0m0.325s
sys 0m1.575s
$ time ./normed
499999500000.000000
real 0m1.874s
user 0m0.304s
sys 0m1.567s
$ time ./normed
499999500000.000000
real 0m1.873s
user 0m0.296s
sys 0m1.573s
$ gcc -O2 -o normed normed.c
$ time ./normed
499999500000.000000
real 0m0.051s
user 0m0.017s
sys 0m0.024s
$ time ./normed
499999500000.000000
real 0m0.050s
user 0m0.017s
sys 0m0.023s
$ time ./normed
499999500000.000000
real 0m0.051s
user 0m0.014s
sys 0m0.022s
$