1

I have implemented the Jacobi algorithm based on the routine described in the book Numerical Recipes but since I plan to work with very large matrices I am trying to parallelize it using openmp.

void ROTATE(MatrixXd &a, int i, int j, int k, int l, double s, double tau)
{
double g,h;
g=a(i,j);
h=a(k,l);
a(i,j)=g-s*(h+g*tau);
a(k,l)=h+s*(g-h*tau);

}

void jacobi(int n, MatrixXd &a, MatrixXd &v, VectorXd &d )
{
int j,iq,ip,i;
double tresh,theta,tau,t,sm,s,h,g,c;

VectorXd b(n);
VectorXd z(n);

v.setIdentity();    
z.setZero();

#pragma omp parallel for 
for (ip=0;ip<n;ip++)
{   
    d(ip)=a(ip,ip);
    b(ip)=d(ip);
}

for (i=0;i<50;i++) 
{
    sm=0.0;
    for (ip=0;ip<n-1;ip++) 
    {
        #pragma omp parallel for reduction (+:sm)
        for (iq=ip+1;iq<n;iq++)
            sm += fabs(a(ip,iq));
    }
    if (sm == 0.0) {
        break;
    }

    if (i < 3)
    tresh=0.2*sm/(n*n); 
    else
    tresh=0.0;  

    #pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
    for (ip=0;ip<n-1;ip++)
    {
    //#pragma omp parallel for private (g,h,t,theta,c,s,tau)
        for (iq=ip+1;iq<n;iq++)
        {
            g=100.0*fabs(a(ip,iq));
            if (i > 3 && (fabs(d(ip))+g) == fabs(d[ip]) && (fabs(d[iq])+g) == fabs(d[iq]))
            a(ip,iq)=0.0;
            else if (fabs(a(ip,iq)) > tresh)
            {
                h=d(iq)-d(ip);
                if ((fabs(h)+g) == fabs(h))
                {
                    t=(a(ip,iq))/h;
                }   
                else 
                {
                    theta=0.5*h/(a(ip,iq));
                    t=1.0/(fabs(theta)+sqrt(1.0+theta*theta));
                    if (theta < 0.0)
                    {
                        t = -t;
                    }
                    c=1.0/sqrt(1+t*t);
                    s=t*c;
                    tau=s/(1.0+c);
                    h=t*a(ip,iq);

                   #pragma omp critical
                    {
                    z(ip)=z(ip)-h;
                    z(iq)=z(iq)+h;
                    d(ip)=d(ip)-h;
                    d(iq)=d(iq)+h;
                    a(ip,iq)=0.0;


                    for (j=0;j<ip;j++)
                        ROTATE(a,j,ip,j,iq,s,tau);
                    for (j=ip+1;j<iq;j++)
                        ROTATE(a,ip,j,j,iq,s,tau);
                    for (j=iq+1;j<n;j++)
                        ROTATE(a,ip,j,iq,j,s,tau);
                    for (j=0;j<n;j++)
                        ROTATE(v,j,ip,j,iq,s,tau);
                    }

                }
            } 
        }
    }


}

}

I wanted to parallelize the loop that does most of the calculations and both comments inserted in the code:

 //#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
 //#pragma omp parallel for private (g,h,t,theta,c,s,tau)

are my attempts at it. Unfortunately both of them end up producing incorrect results. I suspect the problem may be in this block:

z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;

because usually this sort of accumulation would need a reduction, but since each thread accesses a different part of the array, I am not certain of this.

I am not really sure if I am doing the parallelization in a correct manner because I have only recently started working with openmp, so any suggestion or recommendation would also be welcomed.

Sidenote: I know there are faster algorithms for eigenvalue and eigenvector determination including the SelfAdjointEigenSolver in Eigen, but those are not giving me the precision I need in the eigenvectors and this algorithm is.

My thanks in advance.

Edit: I considered to correct answer to be the one provided by The Quantum Physicist because what I did does not reduce the computation time for system of size up to 4096x4096. In any case I corrected the code in order to make it work and maybe for big enough systems it could be of some use. I would advise the use of timers to test if the

#pragma omp for

actually decrease the computation time.

jcarvalho
  • 115
  • 2
  • 11
  • I didn't confirm and run this code... but I would guess your `ROTATE` function has contention between threads. To prove it you can try putting a `#pragma omp critical { ... }`. Additionally, unless you need to have C89 compliance (You're using C++, so..) ... I personally find it much easier to reduce the scope of parameters to the narrowest scope, it makes private/shared more semantically obvious. – cjhanks Oct 05 '16 at 18:05
  • 1
    Regarding your sidenote, it would nice to figure out why `SelfAdjointEigenSolver` is not giving you the expected accuracy. To this end, could you either post/send me a self-contained exemple or one of your problematic matrix so that we can investigate on our side? (you'll find my email in almost all Eigen's header files). Thanks. – ggael Oct 06 '16 at 07:36
  • Meta-question: Why write your own? Why not use one from an existing, optimised, math-library. Intel MKL is gratis https://software.intel.com/en-us/articles/free-mkl and contains Jacobi functions. https://software.intel.com/en-us/node/522101 I don't know for sure that they're parallelized, but other parts of MKL certainly are. ("The best code is the code I don't have to write"). (FWIW I work for Intel, but not on MKL). – Jim Cownie Oct 06 '16 at 09:28
  • Thanks for your suggestion, I will check them out. I want to obtain the eigenvalues and eigenvectors not solve a linear system but maybe there is also something that does this in the intel libraries. – jcarvalho Oct 06 '16 at 15:41

1 Answers1

0

I'll try to help, but I'm not sure this is the answer to your question.

There are tons of problems with your code. My friendly advice for you is: Don't do parallel things if you don't understand the implications of what you're doing.

For some reason, it looks like that you think putting everything in parallel #pragma for will make it faster. This is VERY wrong. Because spawning threads is an expensive thing to do and costs (relatively) lots of memory and time. So if you redo that #pragma for for every loop, you'll respawn threads for every loop, which will significantly reduce the speed of your program... UNLESS: Your matrices are REALLY huge and the computation time is >> than the cost of spawning them.

I fell into a similar issue when I wanted to multiply huge matrices, element-wise (and then I needed the sum for some expectation value in quantum mechanics). To use OpenMP for that, I had to flatten the matrices to linear arrays, and then distribute the array chunks each to a thread, and then run a for loop, where every loop iteration uses elements that are independent of others for sure, and I made them all evolve independently. This was quite fast. Why? Because I never had to respawn threads twice.

Why you're getting wrong results? I believe the reason is because you're not respecting shared memory rules. You have some variable(s) that is being modified by multiple threads simultaneously. It's hiding somewhere, and you have to find it! For example, what does the function z do? Does it take stuff by reference? What I see here:

z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;

Looks VERY multi-threading not-safe, and I don't understand what you're doing. Are you returning a reference that you have to modify? This is a recipe for thread non-safety. Why don't you create clean arrays and deal with them instead of this?

How to debug: Start with a small example (2x2 matrix, maybe), and use only 2 threads, and try to understand what's going on. Use a debugger and define break points, and check what information is shared between threads.

Also consider using a mutex to check what data gets ruined when it becomes shared. Here is how to do it.

My recommendation: Don't use OpenMP unless you plan to spawn the threads ONLY ONCE. I actually believe that OpenMP is going to die very soon because of C++11. OpenMP was beautiful back then when C++ didn't have any native multi-threading implementation. So learn how to use std::thread, and use it, and if you need to run many things in threads, then learn how to create a Thread Pool with std::thread. This is a good book for learning multithreading.

Community
  • 1
  • 1
The Quantum Physicist
  • 24,987
  • 19
  • 103
  • 189
  • OpenMP and C++ Thread don't accomplish the same objectives. Parallel extensions like OpenMP, Cilk++, OpenACC are designed to rewrite your C/C++ code targeted at a specific platform (co-processors, gpus, cpu/numa architectures). `std::thread` is designed for general purpose threading but would require you to be aware of your specific target for real optimization. – cjhanks Oct 05 '16 at 18:23
  • @cjhanks That's up for debate. I disagree with you. What I said at the end is just my opinion. – The Quantum Physicist Oct 05 '16 at 18:26
  • Matrices can be 4096*4096 no bigger than that. I know spawning different threads can be inefficient but I planned on using timers to test the speed. As for your question, z is an array in Eigen, and what I am doing is accessing a coefficient. – jcarvalho Oct 05 '16 at 18:41
  • @jcarvalho This size is on the verge of being big, but it's not really that huge. See the math: Every spawned thread takes in average about 1 MB of memory (learned that from the book I listed in my answer). That matrix will take 4096^2 doubles, which is about 120 MB of memory. So if you spawn 20 threads, your threads are about 17% as expensive as your data, hence, the cost of spawning threads is not negligible. So I suggest that you work your program in a way that avoids spawning threads multiple times. – The Quantum Physicist Oct 06 '16 at 08:47
  • Yeah you were right, I got the code work by both the z(ip)=z(ip)-h; and the ROTATE() part in a critical block but none of the parallelization increased the performances of the code and some even drastically reduced it. The proper way of doing this would be as you said reduce the number of times I spawn threads but in this case I don't think it is worth the work thanks anyway. I edited the question to add the solution that fixed the problem because for big enough system this may be helpful. – jcarvalho Oct 06 '16 at 15:41
  • @jcarvalho When you make all your code critical then you're destroying any benefit of parallelization. You have to find the balance that will optimize your code. Parallelization is not something easy in general, and you have to do it very patiently, or you'll get wrong results. My recommendation for your case, is that you lean how to use std::thread, and spawn threads once, and use them repeatedly. Good luck! – The Quantum Physicist Oct 06 '16 at 16:22
  • Yes I know but that part of the code was not working very well, I suppose it was because of thread contention like cjhanks said. Since the rest was parallelized I hoped I would get a decrease in computation time but as turns out I didn't. – jcarvalho Oct 06 '16 at 17:37
  • The death of OpenMP won't happen before some years. It supposes all embedded platforms, even the very very small ones (some MBits of memory), to allow C++11 code which is not yet the case. And that won't be soon in my opinion. – baptiste Oct 07 '16 at 15:45