I'm implementing an MPI program in C which does SOR (successive overrelaxation) on a grid. When benchmarking it, I came across something quite unexpected, namely that the address-of operator &
appears to be very slow. I can't show the entire code here, and it's also too long, but the relevant parts are as follows.
double maxdiff, diff;
do {
maxdiff = 0.0;
/* inner loops updating maxdiff a lot */
/* diff is used as a receive buffer here */
MPI_Allreduce(&maxdiff, &diff, 1, MPI_DOUBLE, MPI_MAX, my_comm);
maxdiff = diff;
} while(maxdiff > stopdiff);
Here, stopdiff is some magic value. The slow behaviour appears in the MPI_Allreduce()
operation. The strange thing is that that operation is even very slow when running on just a single node, even though no communication is needed in that case. When I comment the operation out, the runtime for a particular problem on one node decreases from 290 seconds to just 225 seconds. Also, when I replace the operation with an MPI_Allreduce()
call using other bogus variables, I get 225 seconds as well. So it looks like it is specifically getting the addresses of maxdiff
and diff
which is causing the slowdown.
I updated the program by making two extra double
variables used as temporary send/receive buffers, as follows.
send_buf = maxdiff;
MPI_Allreduce(&send_buf, &recv_buf, 1, MPI_DOUBLE, MPI_MAX, my_comm);
maxdiff = recv_buf;
This also made the program run in 225 seconds instead of 290. My question is, obviously, how can this be?
I do have a suspicion: the program is compiled using gcc with optimization level O3, so I suspect that the compiler is doing some optimization which is making the reference operation very slow. For instance, perhaps the variables are stored in cpu registers because they are used so often in the loop, and because of this they have to be flushed back to memory whenever their address is requested. However, I can't seem to find out via googling what kind of optimization might cause this problem, and I'd like to be sure about the problem. Does anybody have an idea what might be causing this?
Thanks in advance!
I should add some other important information here. The specific problem being run fills up the memory pretty bad. It uses 3GB of memory, and the nodes have a total of 4GB RAM. I also observe that the slowdown gets worse for larger problem sizes, as RAM fills up, so the amount of load on the RAM seems to be a factor in the problem. Also, strangely enough, when I add the MPI_Allreduce()
just once after the loop instead of inside the loop, the slowdown is still there in the non-optimized version of the program, and it is still just as bad. The program does not run any faster that way.
As requested below, here is part of the gcc assembly output. Unfortunately, I don't have enough experience with assembly to gather the problem from this. This is the version with the added send and receive buffers, so the version which runs in 225 seconds rather than 290.
incl %r13d
cmpl $1, %r13d
jle .L394
movl 136(%rsp), %r9d
fldl 88(%rsp)
leaq 112(%rsp), %rsi
leaq 104(%rsp), %rdi
movl $100, %r8d
movl $11, %ecx
movl $1, %edx
fstpl 104(%rsp)
call MPI_Allreduce
fldl 112(%rsp)
incl 84(%rsp)
fstpl 40(%rsp)
movlpd 40(%rsp), %xmm3
ucomisd 96(%rsp), %xmm3
jbe .L415
movl 140(%rsp), %ebx
xorl %ebp, %ebp
jmp .L327
Here is what I believe is the corresponding part in the program without the extra send and receive buffers, so the version which runs in 290 seconds.
incl %r13d
cmpl $1, %r13d
jle .L314
movl 120(%rsp), %r9d
leaq 96(%rsp), %rsi
leaq 88(%rsp), %rdi
movl $100, %r8d
movl $11, %ecx
movl $1, %edx
call MPI_Allreduce
movlpd 96(%rsp), %xmm3
incl 76(%rsp)
ucomisd 80(%rsp), %xmm3
movsd %xmm3, 88(%rsp)
jbe .L381
movl 124(%rsp), %ebx
jmp .L204