1

I am using armadillo library. The part of my program that is too slow and I need to speed it up is the following

for(int q(0); q < Nk*Nk; q++){
    for(int k(0); k < Nk*Nk; k++){
        int kq = (k+q) % (Nk*Nk);
        cx_mat Gx = ((Eigveck.slice(k)).t())*(Vxk.slice(k)-Vxk.slice(kq))*Eigveck.slice(kq);
        cx_mat Gy = ((Eigveck.slice(k)).t())*(Vyk.slice(k)-Vyk.slice(kq))*Eigveck.slice(kq);
        vec ek = Eigvalk.col(k);
        vec ekq = Eigvalk.col(kq);
        for(int i(0); i < Ltot; i++){
            for(int j(0); j < Ltot; j++){
                chi = chi + (abs(Gx(i,j))*abs(Gx(i,j))+abs(Gy(i,j))*abs(Gy(i,j)))*(1.0/(1.0+exp(ekq(j)/T))-1.0/(1.0+exp(ek(i)/T)))*((ekq(j)-ek(i))/((ekq(j)-ek(i))*(ekq(j)-ek(i))+eta*eta))/(Nk*Nk);
            }
        }

    }
    double qx = (G1(0)*floor(q/Nk)/Nk+G2(0)*(q % Nk)/Nk);
    double qy = (G1(1)*floor(q/Nk)/Nk+G2(1)*(q % Nk)/Nk);

    lindhard << qx << "     " << qy << "     " << -chi << "    " << endl;
}

Before this part, I define a huge matrix, Eigvalk and huge cubes, Eigveck, Vxk, Vyk.

Now, the call to their values is extremely slow in the for loop, it takes ages. The cubes contain eigenvectors and other quantities of a given problem. The thing is that for Nk=10 (very small Nk to test the code), it takes 0.1 seconds to compute Nk*Nk=100 times 47 eigenvectors and it takes 4.5 seconds to perform the loop shown that uses them. I have checked that the part that costs time is the call

cx_mat Gx = .....

I have also tried to define vector or huge cx_mat (by vectorising matrices) instead of cx_cube, but nothing changes.

Is there a better way solve this?

1 Answers1

0

I don't see mayor errors. The order of traversing of the matrix is ok.

I think your code may be efficiently be calculated in parallel using a openMP reduction like this

for(int q(0); q < Nk*Nk; q++){
    #pragma omp parallel for default(shared) reduction(+:chi)
    for(int k(0); k < Nk*Nk; k++){
        int kq = (k+q) % (Nk*Nk);
        cx_mat Gx = ((Eigveck.slice(k)).t())*(Vxk.slice(k)-Vxk.slice(kq))*Eigveck.slice(kq);
        cx_mat Gy = ((Eigveck.slice(k)).t())*(Vyk.slice(k)-Vyk.slice(kq))*Eigveck.slice(kq);
        vec ek = Eigvalk.col(k);
        vec ekq = Eigvalk.col(kq);
        for(int i(0); i < Ltot; i++){
            for(int j(0); j < Ltot; j++){
                chi = chi + (abs(Gx(i,j))*abs(Gx(i,j))+abs(Gy(i,j))*abs(Gy(i,j)))*(1.0/(1.0+exp(ekq(j)/T))-1.0/(1.0+exp(ek(i)/T)))*((ekq(j)-ek(i))/((ekq(j)-ek(i))*(ekq(j)-ek(i))+eta*eta))/(Nk*Nk);
            }
        }

    }
    double qx = (G1(0)*floor(q/Nk)/Nk+G2(0)*(q % Nk)/Nk);
    double qy = (G1(1)*floor(q/Nk)/Nk+G2(1)*(q % Nk)/Nk);

    lindhard << qx << "     " << qy << "     " << -chi << "    " << endl;
}

Ps.

Perhaps you define some const local variables, such as

const auto delta = ekq(j)-ek(i);

How did you measure your hot spot?

What compiler options do you use? I assume that you have proper optimization level turned on, right?

schorsch312
  • 5,553
  • 5
  • 28
  • 57
  • From code inspection, that inner-most statement will run `Nk^4 * Ltot^2` times, but yes, a profiler could help a lot, or some type of measuring. – Chris O Feb 28 '18 at 13:31
  • Thank you very much for your feedback. I was also planning to parallelize the for loop, but first I wanted to be sure that the core was optimized. I measured the hot spot by selectively commenting each part of the code (a bit brute force but it works). Compiler option -O2 – Luca Chirolli Feb 28 '18 at 13:48
  • 1
    I did a couple of more checks and what really consumes time is the matrix multiplication in Gx and Gy. Thanks again – Luca Chirolli Feb 28 '18 at 14:23