C address-of operator very slow

Question

I'm implementing an MPI program in C which does SOR (successive overrelaxation) on a grid. When benchmarking it, I came across something quite unexpected, namely that the address-of operator & appears to be very slow. I can't show the entire code here, and it's also too long, but the relevant parts are as follows.

double maxdiff, diff;

do {
    maxdiff = 0.0;

    /* inner loops updating maxdiff a lot */

    /* diff is used as a receive buffer here */
    MPI_Allreduce(&maxdiff, &diff, 1, MPI_DOUBLE, MPI_MAX, my_comm);
    maxdiff = diff;
} while(maxdiff > stopdiff);

Here, stopdiff is some magic value. The slow behaviour appears in the MPI_Allreduce() operation. The strange thing is that that operation is even very slow when running on just a single node, even though no communication is needed in that case. When I comment the operation out, the runtime for a particular problem on one node decreases from 290 seconds to just 225 seconds. Also, when I replace the operation with an MPI_Allreduce() call using other bogus variables, I get 225 seconds as well. So it looks like it is specifically getting the addresses of maxdiff and diff which is causing the slowdown.

I updated the program by making two extra double variables used as temporary send/receive buffers, as follows.

send_buf = maxdiff;
MPI_Allreduce(&send_buf, &recv_buf, 1, MPI_DOUBLE, MPI_MAX, my_comm);
maxdiff = recv_buf;

This also made the program run in 225 seconds instead of 290. My question is, obviously, how can this be?

I do have a suspicion: the program is compiled using gcc with optimization level O3, so I suspect that the compiler is doing some optimization which is making the reference operation very slow. For instance, perhaps the variables are stored in cpu registers because they are used so often in the loop, and because of this they have to be flushed back to memory whenever their address is requested. However, I can't seem to find out via googling what kind of optimization might cause this problem, and I'd like to be sure about the problem. Does anybody have an idea what might be causing this?

Thanks in advance!

I should add some other important information here. The specific problem being run fills up the memory pretty bad. It uses 3GB of memory, and the nodes have a total of 4GB RAM. I also observe that the slowdown gets worse for larger problem sizes, as RAM fills up, so the amount of load on the RAM seems to be a factor in the problem. Also, strangely enough, when I add the MPI_Allreduce() just once after the loop instead of inside the loop, the slowdown is still there in the non-optimized version of the program, and it is still just as bad. The program does not run any faster that way.

As requested below, here is part of the gcc assembly output. Unfortunately, I don't have enough experience with assembly to gather the problem from this. This is the version with the added send and receive buffers, so the version which runs in 225 seconds rather than 290.

    incl    %r13d
    cmpl    $1, %r13d
    jle     .L394
    movl    136(%rsp), %r9d
    fldl    88(%rsp)
    leaq    112(%rsp), %rsi
    leaq    104(%rsp), %rdi
    movl    $100, %r8d
    movl    $11, %ecx
    movl    $1, %edx
    fstpl   104(%rsp)
    call    MPI_Allreduce
    fldl    112(%rsp)
    incl    84(%rsp)
    fstpl   40(%rsp)
    movlpd  40(%rsp), %xmm3
    ucomisd 96(%rsp), %xmm3
    jbe     .L415
    movl    140(%rsp), %ebx
    xorl    %ebp, %ebp
    jmp     .L327

Here is what I believe is the corresponding part in the program without the extra send and receive buffers, so the version which runs in 290 seconds.

    incl    %r13d
    cmpl    $1, %r13d
    jle     .L314
    movl    120(%rsp), %r9d
    leaq    96(%rsp), %rsi
    leaq    88(%rsp), %rdi
    movl    $100, %r8d
    movl    $11, %ecx
    movl    $1, %edx
    call    MPI_Allreduce
    movlpd  96(%rsp), %xmm3
    incl    76(%rsp)
    ucomisd 80(%rsp), %xmm3
    movsd   %xmm3, 88(%rsp)
    jbe     .L381
    movl    124(%rsp), %ebx
    jmp     .L204

Note that "`&`" is the "address-of" operator, not the "reference" operator. — Oliver Charlesworth, Jan 28 '12 at 18:18
I thought "address-of" and "reference" operator were the same thing. As opposed to the "dereference" operator `*`. I could be wrong though. — Jack Ryan, Jan 28 '12 at 18:29
Note: you are setting maxdiff to 0.0 on every iteration. I'd expect to see `maxdiff = 0.0` before the `do {` line. — wildplasser, Jan 28 '12 at 18:31
That's correct, the maxdiff variable is the maximum grid difference encountered in that iteration of the algorithm, so it should be set to 0.0 on every iteration. — Jack Ryan, Jan 28 '12 at 19:13
You're guessing. Just get a *[stackshot](http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024)*. — Mike Dunlavey, Jan 28 '12 at 20:02
where are `maxdiff` and `diff` defined? are they local variables or global variables? if they are global, what happens if you make them static? — CAFxX, Jan 28 '12 at 20:06
The clockcount for the functioncalls is probably somewhere between 10 and 100 tics. In both versions. What happens *inside* the function is more expensive (sending/receiving messages to *all* slaves) Plus waiting until *all of them* have answered. I'd suggest compiling with -pg (if you are using gcc) and running gprof on it. — wildplasser, Jan 28 '12 at 20:10
What happens if you change `maxdiff = diff;} while(maxdiff > stopdiff);` to `} while (diff > stopdiff);`? — Jonathan Dursi, Jan 28 '12 at 23:33

score 3 · Answer 1 · answered Jan 28 '12 at 18:15

3

This sounds a bit unlikely to me. Getting the address of some double should actually be rather fast.

If you still suspect it, how about getting the addresses only once?

double maxdiff, diff;
double *pmaxdiff = &maxdiff, *pdiff = &diff;

...

MPI_Allreduce(pmaxdiff, pdiff, 1, MPI_DOUBLE, MPI_MAX, my_comm);

...

Overall, I'd suspect the slowdown to happen somewhere else, but give it a try.

answered Jan 28 '12 at 18:15

Mario

35,726
5
62
78

I know, it seemed quite unlikely to me too. However, fact is that using the temporary send and receive buffers, the program runs in 225 seconds, and without them it runs in 290 seconds. I didn't change anything else. – Jack Ryan Jan 28 '12 at 18:28

score 0 · Answer 2 · answered Jan 28 '12 at 19:09

0

I'd suggest looking on the generated assembly and/or posting it here.

You can get the assembly using gcc -S <source>

answered Jan 28 '12 at 19:09

Michael Pankov

3,581
2
23
31

In response to your comment, I did analyze the assembly. I'm no assembly expert, but the main difference I see is that the fast version appears to use specialized floating point instructions, whereas the slow version does not. This made me consider another possibility, namely that the use of the address-of operator was actually preventing the compiler from doing some optimization, which it could do with the added send/receive buffers there. – Jack Ryan Jan 28 '12 at 19:56
@JackRyan how the FP instructions are used in address computation? Can you say which ones are? – Michael Pankov Jan 28 '12 at 22:00

Zulan · Answer 3 · 2012-01-30T11:23:08.637

I recommend to use a performance analysis tool[1] for MPI programs to get a better understanding of what is happening. I would guess that it is the actual MPI_Allreduce call that is prolonged in the different versions of your code, rather than the address computation. As you mentioned, memory is critical here - so PAPI counters (cache misses etc.) might give a hint towards the issue.

[1] such as:

VampirTrace(comes with OpenMPI)/Vampir: http://www.vampir.eu
TAU/paraprof http://www.cs.uoregon.edu/research/tau/docs/paraprof/index.html
http://www.hpctoolkit.org
http://www.scalasca.org/
http://www.bsc.es/computer-sciences/performance-tools/paraver

score 0 · Answer 4 · answered Feb 01 '12 at 21:04

I have a suspicion that it's not taking the address that's the problem, but rather what that address ends up being. Here's what I mean.

I'm assuming your code doesn't touch the diff variable until the MPI_Allreduce call, and diff just happens to live on a separate cache line from the other variables. Because of the large data size for the problem, the cache line containing diff is evicted from the cache by the time of the call. Now MPI_Allreduce performs a write to diff. Intel's CPUs use a write-allocate policy, meaning that prior to a write they will perform a read to bring the line into the cache.

The temporary variables on the other hand are probably sharing a cache line with something else that is used locally. The write does not result in a cache miss.

Try the following: replace

MPI_Allreduce(&maxdiff, &diff, 1, MPI_DOUBLE, MPI_MAX, my_comm);
maxdiff = diff;

with

MPI_Allreduce(MPI_IN_PLACE, &maxdiff, 1, MPI_DOUBLE, MPI_MAX, my_comm);

It's just a theory though.

C address-of operator very slow

4 Answers4