Optimizing C loops to get diagonal of array

Question

The Great God Google has not been forthcoming to me with an explanation for some loop optimization issues. So, in sadness that I have insufficient Google-fu, I turn to you StackOverflow.

I'm optimizing a C program for solving a specific system of differential equations. During the course of finding the numerical solution I call a function that sets up a linear system of equations and then a function that solves it.

The solution function originally had a bottleneck during the access of the elements on the diagonal of the array that defines the linear system. So I included a 1-D array that is set during the initialization of the system that holds the values along the diagonal of the array.

For fun I continued to play with the code that initialized the diagonal elements, measuring the time it took and trying to continually improve the code. The versions I tried led to a few questions:

Note: I put all the versions I tried into one function and profiled this function to see where the time was being spent. I'll report execution time for a version in percent of total time in function. The function was evaluated several million times. Smaller number is better.

Relevant declarations of data used in code:

/* quick definitions of the relevant variables, data is a struct */

static const int sp_diag_ind[98] = {2,12,23,76,120,129,137,142,.../* long list */};

double *spJ = &(data->spJ[0]);
/* data has double spJ[908] that represents a sparse matrix stored in triplet
*  form, I grab the pointer because I've found it to be more 
*  efficient than referencing data->spJ[x] each time I need it
*/

int iter,jter;
double *diag_data = NV_DATA_S(data->J_diag);
/* data->J_diag has a content field that has an array double diag_data[150]
*  NV_DATA_S is a macro to return the pointer to the relevant data
*/

My original loop for initializing diag_data. Timing was 16.1% of evaluation (see note).

/* try 1 */
for (iter = 0; iter<3; iter++) {
    diag_data[iter] = 0; 
}
jter = 0;
for (iter = 3; iter<101; iter++) { // unaligned loop start
    diag_data[iter] = spJ[sp_diag_ind[jter]];
    jter++; // heavy line for loop
}

for (iter = 101; iter<150; iter++) {
    diag_data[iter] = 0; 
}

To summarize, we grab the pointer to the diagonal, set the some components to zero (this is not optional based on the algorithm I'm using), then grab the values that reside on the diagonal of the "array" represented in sparse form by spJ. Since spJ is a 1-D array of the 908 non-zeros of a (mostly zero) 150x150 array we must use a lookup to find the positions of the diagonal elements in spJ. This lookup is defined by the 98 element array sp_diag_ind.

I tried to remove the use of jter because it showed up as not free to increment. The middle loop of my second try:

for (iter = 0; iter<98; iter++) { // unaligned loop start
    diag_data[iter+3] = spJ[sp_diag_ind[iter]];
}

This improved things a bit. Timing was 15.6% for this version. But when I look at the Shark analysis of this code (tool that comes with XCode on a Mac), it warns me that this is an unaligned loop.

The third attempt to improve was by removing the "zeroing" loops and using memset to zero diag_data:

memset(diag_data, '\0', sizeof(diag_data));

for (iter = 0; iter<98; iter++) { // unaligned loop start
    diag_data[iter+3] = spJ[sp_diag_ind[iter]];
}

This was timed at 14.9%. Not being sure what an unaligned loop is, I continued to fiddle. I found an improved, fourth implementation, doing the alignment offset between diag_data and spJ[crazy index] with a pointer:

realtype * diag_mask = &diag_data[3];
for (iter = 0; iter<98; iter++) { // unaligned loop start
    diag_mask[iter] = spJ[sp_diag_ind[iter]];
}

Using diag_mask allowed for a small improvement in speed. It came in at 13.1%.

Edit: Turns out this section was sillier than I had originally thought. The use of iter is undefined. Props to @caf and @rlibby for catching it.

Lastly, I then I tried something I thought was silly:

memset(diag_data, '\0', sizeof(diag_data));

for (iter = 0; iter<98;) {
    diag_mask[iter] = spJ[sp_diag_ind[iter++]];
}

This was timed at 10.9%. Also, Shark doesn't issue the unaligned loop warning when I look at the annotated source code. End silly section

So, my questions:

What's an unaligned loop?
Why is the fifth implementation aligned and the fourth not?
Is having an aligned loop responsible for the improvement in execution speed between my fourth and fifth implementations, or is embedding the increment step in the lookup of the value of sp_diag_ind responsible?
Do you see any other improvements I can make?

Thanks for the help.

--Andrew

Jeremiah Willcock · Accepted Answer · 2011-02-26T03:26:03.317

An unaligned loop is one where the first instruction doesn't start on a particular boundary (multiple of 16 or 32). There should be a compiler flag to align loops; it may or may not help performance. Whether a loop is aligned or not without the flag is based on exactly which instructions come before it, so it's not predictable. One other optimization you could try is to mark diag_mask, spJ, and sp_diag_ind as restrict (a C99 feature). That indicates that they do not alias and might help the compiler to optimize the loop better. A count of 98 will probably be too small to see any effect, though.

caf · Answer 2 · 2011-02-26T03:58:37.050

1

Your fifth version is incorrect - it has undefined behaviour, because it both modifies iter and references its value, for a purpose other than calculating the new value, without an intervening sequence point.

Have you tried storing the actual values of the diagonals, rather than their indexes within spJ, at the point that you calculate sp_diag_ind[]? Then you could just copy them directly into diag_data (or, even better, use the vector of diagonals directly).

The relevant part of the C standard is §6.5 Expressions:

'2. Between the previous and next sequence point an object shall have its stored value modified at most once by the evaluation of an expression. Furthermore, the prior value shall be read only to determine the value to be stored.

This applies to the object iter in your expression. Violation of a "shall" constraint is undefined behaviour.

gcc (tested with version 4.4.5) even warns about your expression:

x.c:16: warning: operation on ‘iter’ may be undefined

edited Feb 26 '11 at 03:58

answered Feb 26 '11 at 03:43

caf

233,326
40
323
462

Are you sure it's undefined? I was under the impression that (at least in C) ++i used in an expression would increment i then use its value and i++ in an expression would use the value of i then increment it. Also, after playing around with a couple of versions of this, including ones where I would use different combinations of +2 or +3 for the off-set and increment iter at different points in the expression before and after using it (by my notions), gcc-4.2 at least appears to abide by my understanding. – Sevenless Feb 26 '11 at 03:50
@Sevenless, yes it is undefined. The problem is that you use `i` on the left hand side of the assignment expression and `i++` on the right hand side of the assignment expression. C does not define the order in which these two expressions are evaluated. Since the result of the computation depends on the undefined order of the evaluation of the expressions, the result is undefined. – rlibby Feb 26 '11 at 03:56
@Sevenless: It is incremented at some time before the next sequence point - so, apart from anything else, you don't know whether the first `[iter]` will use the old or the new value. But that aside, the entire expression is simply undefined for the reason I've given - you can't modify a variable and access its value for an unrelated reason without an intervening sequence point. I've added the relevant part of the standard to my answer. – caf Feb 26 '11 at 03:58
@rlibby Thanks, I had not realized the assignment didn't proceed evaluation of the expression. Glad to learn that bit. – Sevenless Feb 26 '11 at 04:00
Question is edited to mark the bad code as such. Thanks again. – Sevenless Feb 26 '11 at 04:05

score 1 · Answer 3 · answered Feb 26 '11 at 04:16

1

Do you see any other improvements I can make?

You're tuning the daylights out of something that uses about 11% of the time. Is there nothing in the other 89% that could be optimized?

answered Feb 26 '11 at 04:16

Mike Dunlavey

40,059
14
91
135

The percentages were the times spent in each version I produced. That is to say foo(){ version1()/* Takes 16% of foo*/; version2()/* Takes 12% of foo*/; ...}. Once I'm not comparing versions I use only one version inside foo(), but the time in foo is large relative to other functions I've not gotten to. Also, this loop takes ~85% of foo() when only using one version. Sorry if this was not clear in the Q. – Sevenless Feb 26 '11 at 04:34
@Sevenless: FWIW, I use the random-pausing method which doesn't give precise percents but does pinpoint what code to optimize, namely statements or function call sites that appear on two or more samples. Then I measure overall improvement by top-level timing. When I get down to a point of diminishing returns, I consider the program close to optimal. I might be able to shave it a bit here and there, with more effort, but there is an optimal point, and I'm probably close to it. – Mike Dunlavey Feb 26 '11 at 14:39
@Sevenless: I just did this recently on an ODE system. When using an RK solver, first it was getting a lot of hits in the derivative function. I found it doing things in there that could be moved out or memoized. That gave a good speedup, after which statements in the RK solver itself took a larger percentage (like half). Then going to -O3 in the compiler made a sizeable difference. I let the samples tell me where the most benefit is to be found. – Mike Dunlavey Feb 26 '11 at 14:46

Optimizing C loops to get diagonal of array

3 Answers3