I have an algorithm, written in C, which processes a couple of 2-dimensional arrays (e.g. with size Y x X) to produce another 2-dimensional array of the same size. All three arrays contain 32-bit floats and have the same size, Y x X, where Y might be a few tens but X is a million or so.
Unfortunately:
- all the arrays must be in row-major order (scanning through X accesses contiguous memory),
- the algorithm requires the innermost loop to scan over the Y dimension.
Perhaps unsurprisingly, accessing data in this non-contiguous fashion is relatively slow. So...
What can I do to mitigate the performance impact of the non-contiguous memory accesses?
(NB. It was a long shot, but I've tried various patterns of prefetch instructions to bring in upcoming columns, but all to no avail.)
The following (updated) code demonstrates the problem:
#include <stdio.h>
#include <stdlib.h>
#define NX 1000000
#define NY 30
int main() {
float *a = malloc(sizeof(float) * NY * NX);
float *b = malloc(sizeof(float) * NY * NX);
float *c = malloc(sizeof(float) * NY * NX);
size_t y, x, offset;
float v;
for(x=0; x<NX; x++) {
v = 1;
for(y=0; y<NY; y++) {
offset = x + NX * y;
if(a[offset] < 0) {
v = 2;
}
c[offset] = v * b[offset];
}
}
free(a);
free(b);
free(c);
}
On a test machine with an E5520 CPU @ 2.27 GHz this takes ~1 s to execute even though it's only reading ~220 MB and writing ~110 MB.