Optimizing C loops

Question

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.

My question is in three parts based on these expensive loops:

Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?

void spam(){
    double J[151][151];    
    /* Other relevant variables declared */
    calcJac(data,J,y);
    /* Use J */
}

static void calcJac(UserData data, double J[151][151],N_Vector y)
{
    /* The first expensive loop */
    int iter, jter;
    for (iter=0; iter<151; iter++) {
        for (jter = 0; jter<151; jter++) {
            J[iter][jter] = 0;
        }
    }
   /* More code to populate J from data and y that runs very quickly */
}

During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
```
for (iter = 1; iter<151; iter++) {
    for(jter = 1; jter<151; jter++){
        P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
    }
}
```
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
```
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
    for (jter=1; jter<151; jter++) {
        Jv_scratch += J[iter][jter]*Ith(v,jter);
    }
    Ith(Jv,iter) = Jv_scratch;
    Jv_scratch = 0;
}
```

http://www.gnu.org/software/gsl/manual/html_node/GSL-CBLAS-Library.html, http://stackoverflow.com/questions/1965487/does-the-restrict-keyword-provide-significant-anti-aliasing-benefits-in-gcc-g — Anycorn, Feb 21 '11 at 09:15

score 4 · Accepted Answer · edited Feb 21 '11 at 09:00

4

1) No they're not you can memset the array as follows:

memset( J, 0, sizeof( double ) * 151 * 151 );

or you can use an array initialiser:

double J[151][151] = { 0.0 };

2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.

You may well get better performance. by stepping through as pointers:

for (iter = 1; iter<151; iter++) 
{
    double* pP = (P - 1) + (151 * iter);
    double* pJ = data->J + (151 * iter);

    for(jter = 1; jter<151; jter++, pP++, pJ++ )
    {
         *pP = - gamma * *pJ;
    }
}

This way you move various of the array index calculation outside of the loop.

3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.

edited Feb 21 '11 at 09:00

David Heffernan

601,492
42
1,072
1,490

answered Feb 21 '11 at 08:54

Goz

61,365
24
124
204

@Goz i don't think your second method of inializing array eleements to zero would work double j[151][151]={0.0} would only inialize the j[0][0] to zero. – Algorithmist Feb 21 '11 at 08:56
@Algorithmist: did you try that? My experience is that using an array initializer initializes the complete array with the given value... – eckes Feb 21 '11 at 08:58
@eckes: He may be right. It may just be a compiler specific feature. It works on GCC and Visual Studio though ... I assumed, maybe wrongly, it was a feature of C99 ... – Goz Feb 21 '11 at 08:59
@Goz: I doubt the pointer trick makes any difference at all, as a modern compiler will do this automatically. – Oliver Charlesworth Feb 21 '11 at 09:03
@Goz, C99 §6.7.8/19: "The initialization shall occur in initializer list order, each initializer provided for a particular subobject overriding any previously listed initializer for the same subobject; _all subobjects that are not initialized explicitly shall be initialized implicitly the same as objects that have static storage duration_." (emph. mine) – bdonlan Feb 21 '11 at 09:03
I should also point out that this is typically compiled down to a `memset` call for large objects. – bdonlan Feb 21 '11 at 09:04
It sets the first element explicitly to zero, and the rest of the elements as if they had static allocation (ie zero as well). So you are both correct. – Lundin Feb 21 '11 at 09:12
I'd strongly advocate to use the initializer syntax for 1) since not all platforms might code `0.0` as all zero bits. The is an initializer syntax works platform independent. In most cases this will be realized as something similar to `memset`, but it is better to leave that up to the compiler. – Jens Gustedt Feb 21 '11 at 09:14
@Goz. Ok. Also tested that stuff always with GCC and VS and it works there and assumed it to be a standard behavior. Good to know that this isn't the case! – eckes Feb 21 '11 at 09:16
@Oli, dereferencing the pointer inside the loop may be his performance killer. Since `data` is declared as an opaque pointer, the compiler might well get lost on optimization. Just declaring something like `double (*const myJ)[151] = data->J;` outside the loops might already be enough to cure that. – Jens Gustedt Feb 21 '11 at 09:19
@Jens: You may be right, but I wouldn't understand why! `data` is not an opaque pointer, otherwise he wouldn't be able to dereference it to obtain `data-J`! – Oliver Charlesworth Feb 21 '11 at 09:39
@Jens: Regarding initialisers, `double J[151][151] = { 0.0 }` only sets the *first* element to `0.0`; all the rest are set to the zero bit pattern. – Oliver Charlesworth Feb 21 '11 at 09:42
@Oli: yea, opaque pointer was probably the wrong vocabulary. But it is probably not something that is restrict qualified, so potential aliasing might hinder constant propagation. – Jens Gustedt Feb 21 '11 at 09:45
@Oli: no, the standard explicitly states that all members that are not mentioned explicitly are initialized as if they were with an explicit `0`. The catch all initializer `{ 0 }` really is different from setting all bits to zero or things like that. It is the semantically correct initialization for all data types. – Jens Gustedt Feb 21 '11 at 09:49
@Jens: Yes, you are right about static initializers. I'm still not sure I agree with you about pointer optimization, though! – Oliver Charlesworth Feb 21 '11 at 10:37
@Oli, this is nothing you can agree or not. Only looking at the assembler can tell what a particular compiler is able to do. – Jens Gustedt Feb 21 '11 at 14:32

score 3 · Answer 2 · answered Feb 21 '11 at 08:57

First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.

First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:

static void calcJac(UserData data, double J[151][151],N_Vector y)
{
   memset((void*)J, 0, sizeof(double) * 151 * 151);
   /* More code to populate J from data and y that runs very quickly */
}

memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).

It seems that not all platforms have the value `0.0` implemented as all zero bit values, so `memset` is not as a good idea for floating point values. There is an initializer syntax that is foreseen for that that works platform independent. — Jens Gustedt, Feb 21 '11 at 09:11
Unfortunately, the initializer syntax can only be used at initial declaration though. And although it's not, strictly speaking, portable, most major platforms have `0.0 = 0x0000 0000 0000 0000` — bdonlan, Feb 21 '11 at 09:50

score 1 · Answer 3 · answered Feb 21 '11 at 09:07

Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).

You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.

score 0 · Answer 4 · answered Feb 21 '11 at 09:08

Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?

It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:

double J[151][151] = {0}; /* set first element to zero */

By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.

Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.

In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?

You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.

You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.

As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Optimizing C loops

4 Answers4