1

I have an algorithm, written in C, which processes a couple of 2-dimensional arrays (e.g. with size Y x X) to produce another 2-dimensional array of the same size. All three arrays contain 32-bit floats and have the same size, Y x X, where Y might be a few tens but X is a million or so.

Unfortunately:

  • all the arrays must be in row-major order (scanning through X accesses contiguous memory),
  • the algorithm requires the innermost loop to scan over the Y dimension.

Perhaps unsurprisingly, accessing data in this non-contiguous fashion is relatively slow. So...

What can I do to mitigate the performance impact of the non-contiguous memory accesses?

(NB. It was a long shot, but I've tried various patterns of prefetch instructions to bring in upcoming columns, but all to no avail.)

The following (updated) code demonstrates the problem:

#include <stdio.h>
#include <stdlib.h>

#define NX 1000000
#define NY 30

int main() {
    float *a = malloc(sizeof(float) * NY * NX);
    float *b = malloc(sizeof(float) * NY * NX);
    float *c = malloc(sizeof(float) * NY * NX);

    size_t y, x, offset;
    float v;

    for(x=0; x<NX; x++) {
        v = 1;
        for(y=0; y<NY; y++) {
            offset = x + NX * y;
            if(a[offset] < 0) {
                v = 2;
            }
            c[offset] = v * b[offset];
        }
    }

    free(a);
    free(b);
    free(c);
}

On a test machine with an E5520 CPU @ 2.27 GHz this takes ~1 s to execute even though it's only reading ~220 MB and writing ~110 MB.

Community
  • 1
  • 1
  • If you can break your data up to chuncks/blocks with resonable dimensions that allow it to co-exisit through-out the double loops in the cache, this can help. This depends of course on how your algorithm (and the problem) can merge the resultant blocks. In simple thing like addition above, this doesn't need further work. – hesham_EE Jun 11 '15 at 15:25
  • 1
    Depends on what your loop body is, and what dependencies there are between iterations. Switching the loop order would be the best option, but might require some auxiliary storage. A blocked/tiled algorithm would be a good option, as well, but you have to be careful about your tile size given the specific parameters of your machine (i.e. a tile size that works well in a 3MB cache may not work so well when you run on a system with 512K of cache). I think we'd need more information on your actual algorithm, not the toy test case above... – twalberg Jun 11 '15 at 17:38
  • I've updated the example code to try to illustrate the column-order nature of the processing. – Colonel Mustard Jun 12 '15 at 14:45

1 Answers1

2

It looks like your access pattern shouldn't be that harmful. This makes me wonder if branch prediction is your real problem.

Normally transposed data access is done in chunks to keep the cache healthy, but your input is so short on the inner-loop axis that the cached read of the first row should still be valid by the time you revisit it in your outer loop.

You have three arrays 30 elements high by a cache line width of maybe 128 bytes (I expect smaller, but things change). That's only 12kB of cache that you need for the top row to stay resident.

You could try changing v to a small array and proceeding in vertical stripes, though. Even if this didn't actually help your cache utilisation, it would at least give a hint to the compiler that it could be optimised with SIMD.

You might also try this dangerous optimisation to eliminate branches:

for(x=0; x<NX; x++) {
    uint32_t v = 0;
    for(y=0; y<NY; y++) {
        offset = x + NX * y;
        v |= (((uint32_t *)a)[offset] & 0x80000000) >> 8;
        ((uint32_t *)c)[offset] = ((uint32_t *)b)[offset] + v;
    }
}

This does the arithmetic in the log domain, taking the sign bit of the floating-point value and adding it directly to the exponent and assuming that it won't overflow. Also assuming the format in memory is uint32_t-compatible.

Community
  • 1
  • 1
sh1
  • 4,324
  • 17
  • 30