Optimizing a quadruple nested "for" loop

Question

I'm developing a 2D numerical model in c++, and I would like to speed up a specific member function that is slowing down my code. The function is required to loop over every i,j grid point in the model and then perform a double summation at every grid point over l and m. The function is as follows:

int Class::Function(void) {
    double loadingEta;
    int i,j,l,m;

    //etaLatLen=64, etaLonLen=2*64
    //l_max = 12

    for (i=0; i<etaLatLen; i++) {
        for (j=0; j < etaLonLen; j++) {
            loadingEta = 0.0;
            for (l=0; l<l_max+1; l++) {
                for (m=0; m<=l; m++) {
                    loadingEta += etaLegendreArray[i][l][m] * (SH_C[l][m]*etaCosMLon[j][m] + SH_S[l][m]*etaSinMLon[j][m]);
                }
            }
            etaNewArray[i][j] = loadingEta;
        }
    }

    return 1;
}

I've been trying to change the loop order to speed things up, but to no avail. Any help would be much appreciated. Thank you!

EDIT 1:

All five arrays are allocated in the constructor of my class as follows:

etaLegendreArray = new double**[etaLatLen];
for (int i=0; i<etaLatLen; i++) {
    etaLegendreArray[i] = new double*[l_max+1];
    for (int l=0; l<l_max+1; l++) {
        etaLegendreArray[i][l] = new double[l_max+1];
    }
}

SH_C = new double*[l_max+1];
SH_S = new double*[l_max+1];
for (int i=0; i<l_max+1; i++) {
    SH_C[i] = new double[l_max+1]; 
    SH_S[i] = new double[l_max+1];
}

etaCosMLon = new double*[etaLonLen];
etaSinMLon = new double*[etaLonLen];
for (int j=0; j<etaLonLen; j++) {
    etaCosMLon[j] = new double[l_max+1];
    etaSinMLon[j] = new double[l_max+1];
}

Perhaps it would be better if these were 1D arrays instead of multidimensional?

Changing the loop order will not reduce the complexity. If you want to really speed things up, you might want to divide the work between multiple processes or threads, but that also has overhead. — JGroven, Mar 02 '17 at 19:56
How are your arrays defined? You may be able to improve the cache-ability of your data. — user4581301, Mar 02 '17 at 20:08
Sounds like you're passing a 2D filter over a 2D grid. So transform into the frequency domain using KissFFT, convolve, and then convert back to the spatial domain. — Malcolm McLean, Mar 02 '17 at 20:26
Thanks for your response. I've added added code for how I allocated the arrays to the post. — planetaryHam, Mar 02 '17 at 20:28

score 2 · Accepted Answer · answered Mar 02 '17 at 21:52

Hopping off into X-Y territory here. Rather than speeding up the algorithm, let's try and speed up data access.

etaLegendreArray = new double**[etaLatLen];
for (int i=0; i<etaLatLen; i++) {
    etaLegendreArray[i] = new double*[l_max+1];
    for (int l=0; l<l_max+1; l++) {
        etaLegendreArray[i][l] = new double[l_max+1];
    }
}

Doesn't create a 3D array of doubles. It creates an array of pointers to arrays of pointers to arrays of doubles. Each array is its own block of memory and who knows where it's going to sit in storage. This results in a data structure that has what is called "poor spacial locality." All of the pieces of the structure may be scattered all over the place. In the 3D array you are hopping to three different places just to find out where your value is.

Because the many blocks of storage required to simulate the 3D array may be nowhere near each other, the CPU may not be able to effectively load the cache (high-speed memory) ahead of time and has to stop the useful work it's doing and wait to access slower storage, probably RAM much more frequently. Here is a good, high-level article on how much this can hurt performance.

On the other hand, if the whole array is in one block of memory, is "contiguous", the CPU can read larger chunks of the memory, maybe all of it, it needs into cache all at once. Plus if the compiler knows the memory the program will use is all in one big block it can perform all sorts of groovy optimizations that will make your program even faster.

So how do we get a 3D array that's all one memory block? If the sizes are static, this is easy

double etaLegendreArray[SIZE1][SIZE2][SIZE3];

This doesn't look to be your case, so what you want to do is allocate a 1D array, because it will be one contiguous block of memory.

double * etaLegendreArray= new double [SIZE1*SIZE2*SIZE3];

and do the array indexing math by hand

etaLegendreArray[(x * SIZE2 + y) * SIZE3 + z] = data;

Looks like that ought to be slower with all the extra math, huh? Turns out the compiler is hiding math that looks a lot like that from you every time you use a []. You lose almost nothing, and certainly not as much as you lose with one unnecessary cache miss.

But it is insane to repeat that math all over the place, sooner or later you will screw up even if the drain on readability doesn't have you wishing for death first, so you really want to wrap the 1D array in a class to helper handle the math for you. And once you do that, you might as well have that class handle the allocation and deallocation so you can take advantage of all that RAII goodness. No more for loops of news and deletes all over the place. It's all wrapped up and tied with a bow.

Here is an example of a 2D Matrix class easily extendable to 3D. that will take care of the basic functionality you probably need in a nice predictable, and cache-friendly manner.

score 0 · Answer 2 · edited May 23 '17 at 12:32

If the CPU supports it and the compiler is optimizing enough, you might get some small gain out of the C99 fma (fused multiply-add) function, to convert some of your two step operations (multiply, then add) to one step operations. It would also improve accuracy, since you only suffer floating point rounding once for fused operation, not once for multiplication and once for addition.

Assuming I'm reading it right, you could change your innermost loop's expression from:

loadingEta += etaLegendreArray[i][l][m] * (SH_C[l][m]*etaCosMLon[j][m] + SH_S[l][m]*etaSinMLon[j][m]);

to (note no use of += now, it's incorporated in fma):

loadingEta = fma(etaLegendreArray[i][l][m], fma(SH_C[l][m], etaCosMLon[j][m], SH_S[l][m]*etaSinMLon[j][m]), loadingEta);

I wouldn't expect anything magical performance-wise, but it might help a little (again, only with optimizations turned up enough for the compiler to inline hardware instructions to do the work; if it's calling a library function, you'll lose any improvements to the function call overhead). And again, it should improve accuracy a bit, by avoiding two rounding steps you were incurring.

Mind you, on some compilers with appropriate compilation flags, they'll convert your original code to hardware FMA instructions for you; if that's an option, I'd go with that, since (as you can see) the fma function tends to reduce code readability.

Your compiler may offer vectorized versions of floating point instructions as well, which might meaningfully improve performance (see previous link on automatic conversion to FMA).

Most other improvements would require more information about the goal, the nature of the input arrays being used, etc. Simple threading might gain you something, OpenMP pragmas might be something to look at as a way to simplify parallelizing the loop(s).

Might also be worth mentioning the pitfalls involved in summing floating point numbers (ie: sort so smallest values are summed first). Depending on the architecture (ie: embedded), it might also be worthwhile to use fixed point normalization/summing if integer operations are significantly faster. — Cloud, Mar 02 '17 at 20:28
Interesting, I wasn't aware of FMA. I'm compiling in g++6, so I'll see if FMA is part of the built-in optimization. Thanks! — planetaryHam, Mar 02 '17 at 20:43

Optimizing a quadruple nested "for" loop

2 Answers2