parallel SVD decomposition with openMP deos not perform as expected

Question

I have recently coded a parallel SVD decomposition routine, based on a "one sided Jacobi rotations" algorithm. The code works correctly but is tremendously slow. In fact it should exploit the parallelism in the inner for loop for(int g=0;g<n;g++), but on commenting out the #pragma omp paralell for directive I can appreciate just a very slight decrease in performances. In other words there is no appreciable speed up on going parallel (the code does run parallel with 4 threads).

Note 1: almost all the work is concentrated in the three following loops involving the matrices A and V, which are relatively large.

for(h=0;h<N;h++)
{
    p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
    qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
    qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}

and

double Ahi,Vhi;
for(h=0;h<N;h++)//...rotate Ai & Aj (only columns i & j are changend)
{
    Ahi=A[h+N*i];
    A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
    A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j are updated)
for(h=0;h<N;h++)
{
    Vhi=V[h+N*i];
    V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
    V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}

All the parallelism should be exploited there but is not. And I can't understand why.

Note 2: The same happens both on Windows (cygWin compiler) and Linux (GCC) platforms.

Note 3: matrices are represented by column major arrays

So I'm looking for some help in finding out why the parallelism is not exploited. Did I miss something? There is some hidden overhead in the parallel for I cannot see?

Thank you very much for any suggestion

int sweep(double* A,double*V,int N,double tol)
{
static int*I=new int[(int)ceil(0.5*(N-1))];
static int*J=new int[(int)ceil(0.5*(N-1))];
int ntol=0;
for(int r=0;r<N;r++) //fill in i,j indexes of parallel rotations in vectors     I & J
{
    int k=r+1;
    if (k==N)
    {
    for(int i=2;i<=(int)ceil(0.5*N);i++){
        I[i-2]=i-1;
        J[i-2]=N+2-i-1;
        }
    }
    else
    {
        for(int i=1;i<=(int)ceil(0.5*(N-k));i++)I[i-1]=i-1;
        for(int i=1;i<=(int)ceil(0.5*(N-k));i++)J[i-1]=N-k+2-i-1;
        if(k>2)
        {
            int j=(int)ceil(0.5*(N-k));
            for(int i=N-k+2;i<=N-(int)floor(0.5*k);i++){
                I[j]=i-1;
                J[j]=2*N-k+2-i-1;
                j++;
            }
        }
    }

    int n=(k%2==0)?(int)floor(0.5*(N-1)):(int)floor(0.5*N);

    #pragma omp parallel for schedule(dynamic,5) reduction(+:ntol)  default(none)   shared(std::cout,I,J,A,V,N,n,tol)
    for(int g=0;g<n;g++)
    {
        int i=I[g];
        int j=J[g];
        double p=0;
        double qi=0;
        double qj=0;
        double cs,sn,q,c;
        int h;
        for(h=0;h<N;h++)
        {
            p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
            qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
            qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
        }
        q=qi-qj;
        if(p*p/(qi*qj)<tol) ntol++; //if Ai & Aj are orthogonal enough...
        else                        //if Ai & Aj are not orthogonal enough then... rotate them
        {
            c=sqrt(4*p*p+q*q);
            if(q>=0){
                cs=sqrt((c+q)/(2*c));
                sn=p/(c*cs);
            }
            else{
                sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
                cs=p/(c*sn);
            }
            //...rotate Ai & Aj (only columns i & j are changend)
            double Ahi,Vhi;
            for(h=0;h<N;h++)
            {
                Ahi=A[h+N*i];
                A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
                A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
            }
            //store & update rotation matrix V (only columns i & j are updated)
            for(h=0;h<N;h++)
            {
                Vhi=V[h+N*i];
                V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
                V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
            }
        }
    }
}
if(2*ntol==(N*(N-1)))return(1);//if each columns of A is orthogonal enough     to each other stop sweep

return(0);
}

How did you measure the performance? How do the results look like? — Vladimir F Героям слава, Feb 17 '15 at 22:38
Before I read through your code in more detail can you confirm that you don't have race condition? Does the parallel version get the correct result? Are `h+N*i` and `h+N*j` unique for each thread? — Z boson, Feb 18 '15 at 07:33
Can you also state what your hardware is and you compile options? — Z boson, Feb 18 '15 at 07:33
As far as performances is concerned, I simply let the code run with matrices of 1000x1000 (my target size). The serial code took 64 seconds, the parallel 49 seconds. I can exclude race conditions, as parallel operations take place on different columns of the A and V matrices at a time (as Z boson points out, i & j are unique for each thread), and I verified the results, even just for little problems (not 1000x1000). My hardware is an intel quadcore i7 & I compiled with g++ -fopen -msse4 -O3. I run the test on Ubuntu (intel i3 and same compilation options):no substantial gain from parallelism. — Teol11, Feb 18 '15 at 08:12
For completeness, the routine sweep is called many times, (for instance in my test with dimension 1000x1000, 28 times) before the A matrix is SVD factorized, until it returns 1. — Teol11, Feb 18 '15 at 08:47
Do you know the time complexity of this algorithm? If it's O(n^2) then it's probably memory bandwidth bound and multiple threads won't help much. What level is SVD in BLAS (level 1, 2, or 3)? — Z boson, Feb 18 '15 at 11:03
I guess you have tuned this but what is the permanence when you change `schedule(dynamic,5)` to `schedule(static)`? — Z boson, Feb 18 '15 at 12:21

Teol11 · Answer 1 · 2015-02-27T10:22:19.217

Thanks to Z boson remarks I managed to write a far better performing paralell SVD decomposition. It runs many times faster than the original one, and I guess it could be still improved by the use of SIMD instructions. I post the code, in the case anyone should find it of use. All tests I performed gave me correct results, in any case there is no warranty for its use. I'm really sorry not to have had the time to properly comment the code for a better comprehensibility.

int sweep(double* A,double*V,int N,double tol,int M, int n)
{
/********************************************************************************************
This routine performs a parallel "sweep" of the SVD factorization algorithm for the matrix A.
It implements a single sided Jacobi rotation algorithm, described by Michael W. Berry,
Dani Mezher, Bernard Philippe, and Ahmed Sameh in "Parallel Algorithms for the Singular Value
Decomposition".
At each sweep the A matrix becomes a little more orthogonal, until each column of A is orthogonal
to each other within a give tolerance. At this point the sweep() routine returns 1 and convergence
is attained.

Arguments:

A   : on input the square matrix to be orthogonalized, on exit a more "orthogonal" matrix
V   : on input the accumulated rotation matrix, on exit it is updated with the current rotations
N   : dimension of the matrices.
tol :tolerance for convergence of orthogonalization. See the explainations for SVD() routine
M   : number of blocks (it is calculated from the given block size n)
n   : block size (number of columns on each block). It should be an optimal value according to the hardware available

returns : 1 if in the last sweep convergence is attained, 0 if not and one more sweep is necessary

Author : Renato Talucci 2015
*************************************************************************************************/

#include <math.h>
int ntol=0;
//STEP 1 : INTERNAL BLOCK ORTHOGONALISATION
#pragma omp paralell for reduction(+:ntol) shared(A,V,n,tol,N) default(none)
for(int a=0;a<M;a++)
{
    for(int i=a*n;i<a*n+imin(n,N-a*n)-1;i++)
    {
        for(int j=i+1;j<a*n+imin(n,N-a*n);j++)
        {
            double p=0;
            double qi=0;
            double qj=0;
            double cs,sn,q,c;
            for(int h=0;h<N;h++)
            {
                p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
                qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
                qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
            }
            q=qi-qj;
            if((p*p/(qi*qj)<tol)||(qi<tol)||(qj<tol))ntol++; //if Ai & Aj are orthogonal enough...
            else                        //if Ai & Aj are not orthogonal enough then... rotate them
            {
                c=sqrt(4*p*p+q*q);
                if(q>=0){
                    cs=sqrt((c+q)/(2*c));
                    sn=p/(c*cs);
                }
                else{
                    sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
                    cs=p/(c*sn);
                }
                //...rotate Ai & Aj
                double Ahi,Vhi;
                for(int h=0;h<N;h++)
                {
                    Ahi=A[h+N*i];
                    A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
                    A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
                }
                //store & update rotation matrix V (only columns i & j atre updated)
                for(int h=0;h<N;h++)
                {
                    Vhi=V[h+N*i];
                    V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
                    V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
                }
            }

        }
    }
}

//STEP 2 : PARALLEL BLOCK MUTUAL ORTHOGONALISATION
static int*I=new int[(int)ceil(0.5*(M-1))];
static int*J=new int[(int)ceil(0.5*(M-1))];
for(int h=0;h<M;h++)
{
    //fill in i,j indexes of blocks to be mutually orthogonalized at each turn
    int k=h+1;
    if (k==M)
    {
        for(int i=2;i<=(int)ceil(0.5*M);i++){
            I[i-2]=i-1;
            J[i-2]=M+2-i-1;
        }
    }
    else
    {
        for(int i=1;i<=(int)ceil(0.5*(M-k));i++)I[i-1]=i-1;
        for(int i=1;i<=(int)ceil(0.5*(M-k));i++)J[i-1]=M-k+2-i-1;
        if(k>2)
        {
            int j=(int)ceil(0.5*(M-k));
            for(int i=M-k+2;i<=M-(int)floor(0.5*k);i++){
                I[j]=i-1;
                J[j]=2*M-k+2-i-1;
                j++;
            }
        }

    }
    int ng=(k%2==0)?(int)floor(0.5*(M-1)):(int)floor(0.5*M);

    #pragma omp parallel for schedule(static,5) shared(A,V,I,J,n,tol,N,ng) reduction(+:ntol) default(none)
    for(int g=0;g<ng;g++)
    {
        int block_i=I[g];
        int block_j=J[g];
        for(int i=block_i*n;i<block_i*n+imin(n,N-block_i*n);i++)
        {
            for(int j=block_j*n;j<block_j*n+imin(n,N-block_j*n);j++)
            {
                double p=0;
                double qi=0;
                double qj=0;
                double cs,sn,q,c;
                int h;
                for(h=0;h<N;h++)
                {
                    p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
                    qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
                    qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
                }
                q=qi-qj;
                if((p*p/(qi*qj)<tol)||(qi<tol)||(qj<tol))ntol++; //if Ai & Aj are orthogonal enough...
                else                        //if Ai & Aj are not orthogonal enough then... rotate them
                {
                    c=sqrt(4*p*p+q*q);
                    if(q>=0){
                        cs=sqrt((c+q)/(2*c));
                        sn=p/(c*cs);
                    }
                    else{
                        sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
                        cs=p/(c*sn);
                    }
                    //...rotate Ai & Aj
                    double Ahi,Vhi;
                    for(h=0;h<N;h++)
                    {
                        Ahi=A[h+N*i];
                        A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
                        A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
                    }
                    //store & update rotation matrix V (only columns i & j atre updated)
                    for(h=0;h<N;h++)
                    {
                        Vhi=V[h+N*i];
                        V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
                        V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
                    }
                }

            }
        }


    }
}
if(2*ntol==(N*(N-1)))return(1);//if each columns of A is orthogonal enough to each other stop sweep
return(0);
}

int SVD(double* A,double* U,double*V,int N,double tol,double* sigma)
{
/********************************************************************************************
This routine calculates the SVD decomposition of the square matrix A [NxN]

                                A = U * S * V'

Arguments :
A       :   on input NxN square matrix to be factorized, on exit contains the
            rotated matrix A*V=U*S.
V       :   on input an identity NxN matrix, on exit is the right orthogonal matrix
            of the decomposition A = U*S*V'
U       :   NxN matrix, on exit is the left orthogonal matrix of the decomposition A = U*S*V'.
sigma   :   array of dimension N. On exit contains the singular values, i.e. the diagonal
            elements of the matrix S.
N       :   Dimension of the A matrix.
tol     :   Tolerance for the convergence of the orthogonalisation of A. Said Ai and Aj any two
            columns of A, the convergence is attained when Ai*Aj / ( |Ai|*|Aj| ) < tol for each i,j=0,..,N-1 (i!=j)

The software returns the number of sweeps needed for convergence.

NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING I.E. M(i,j)=M[i+N*j]

Author: Renato Talucci 2015
*************************************************************************************************/

int n=24;//this is the dimension of block submatrices, you shall enter an optimal value for your hardware
int M=N/n+int(((N%n)!=0)?1:0);
int swp=0;//sweeps counter
int converged=0;
while(converged==0) {
    converged=sweep(A,V,N,tol,M,n);
    swp++;
}
#pragma omp parallel for default(none) shared(sigma,A,U,N)
for(int i=0;i<N;i++)
{
    double si=0;
    for(int j=0;j<N;j++) si+=A[j+N*i]*A[j+N*i];
    si=sqrt(si);
    for(int k=0;k<N;k++) U[k+N*i]=A[k+N*i]/si;
    sigma[i]=si;
}

return(swp);
}

Note : if some user prefers to left A unchanged upon exit, it is sufficient to calculate U=A*V before entering the while loop, and instead of passing the A matrix to the sweep() routine, passing the obtained U matrix. The U matrix shall be orthonormalized instead of A, after convergence of sweep() and the #pragma omp directive must include U in the shared variables instead of A.

Note 2:if you have (as I have) to factorize a sequence of A(k) matrices each of which A(k) can be considered a perturbation of the previous A(k-1), as a jacobian matrix can be considered in a multistep Newton solver, the driver can be easily modified to update from A(k-1) to A(k) instead of calculating A(k) from begin. Here is the code:

int updateSVD(double* DA,double* U,double*V,int N,double tol,double* sigma)
{
/********************************************************************************************
Given a previously factorization 

                        A(k-1) = U(k-1) * S(k-1) * V(k-1)'

and given a perturbation DA(k) of A(k-1), i.e.

                                    A(k) = A(k-1) + DA(k)

this routine calculates the SVD factorization of A(k), starting from the factorization of A(k-1)

Arguments:

DA      :   on input NxN perturbation matrix, unchanged on exit
U       :   on input NxN orthonormal left matrix of the previous (k-1) factorization, on exit 
            orthonormal right matrix of the current factorization
V       :   on input NxN orthonormal right matrix of the previous (k-1) factorization, on exit 
            orthonormal right matrix of the current factorization
N       :   dimension of the matrices
tol     :   Tolerance for the convergence of the orthogonalisation of A. Said Ai and Aj any two
            columns of A, the convergence is attained when Ai*Aj / ( |Ai|*|Aj| ) < tol for each i,j=0,..,N-1 (i!=j)
sigma   :   on input, array with the N singular values of the previuos factorization, on exit 
            array with the N singular values of the current factorization
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING I.E. M(i,j)=M[i+N*j]   

Author: Renato Talucci 2015
*************************************************************************************************/


for(int i=0;i<N;i++) for(int j=0;j<N;j++) U[i+N*j]*=sigma[j];
int n=24; //this is the dimension of block submatrices, you shall enter an optimal value for your hardware
matmat_col_col(DA,V,U,N,n); //U =U(k-1)*S(k-1) + DA(k)*V(k-1) = A(k)*V(k-1) 
int M=N/n+int(((N%n)!=0)?1:0);  //number of blocks
int swp=0;//sweeps counter
int converged=0;
while(converged==0) {
    converged=sweep(U,V,N,tol,M,n);
    swp++;
}

#pragma omp parallel for default(none) shared(sigma,U,N)
for(int i=0;i<N;i++)
{
    double si=0;
    for(int j=0;j<N;j++) si+=U[j+N*i]*U[j+N*i];
    si=sqrt(si);
    for(int k=0;k<N;k++) U[k+N*i]=U[k+N*i]/si;
    sigma[i]=si;
}

return(swp);
}

Finally, the routine matmat_col_col(DA,V,U,N,n) is a paralell block matrix product. Here is the code:

inline int imin(int a,int b) {return((a<=b)?a:b);}

void matmat_col_col(double* A,double* B,double*C,int  N,int n)
/********************************************************************************************
square matrix block product NxN :

                            C = C + A * B

n is the optimal block dimension
N is the dimension of the matrices

NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING M(i,j) = M[i+N*j]

Author: Renato Talucci 2015
*************************************************************************************************/
{
 int M=N/n+int(((N%n)!=0)?1:0);
 #pragma omp parallel for shared(M,A,B,C)
 for(int a=0;a<M;a++)
 {
     for(int b=0;b<M;b++)
     {
         for(int c=0;c<M;c++)
         {
             for(int i=a*n;i<imin((a+1)*n,N);i++)
             {
                 for(int j=b*n;j<imin((b+1)*n,N);j++)
                 {
                     for(int k=c*n;k<imin((c+1)*n,N);k++)
                     {
                          C[i+N*j]+=A[i+N*k]*B[k+N*j];
                     }

                 }

             }
         }
     }
 }
return;
}

I hope no typos have been created from so much copy&paste.

It would be nice if anyone could improve the code.

Good work! Could you add a table of the times before and after your improvement. You should also compare your result to a BLAS library. Intel MKL is the best (for Intel processors) if you have it otherwise try Eigen. You should also state exactly what your hardware is. — Z boson, Feb 27 '15 at 10:42
Considering my 1000x1000 test matrix, the first code took 64 seconds. After parallelization it needed still 49 seconds. The new code takes 2 (two) seconds. I stress that these are just indicative performances. Because of lack of time, I could only compare my code with a prebuilt and probably very poorly optimized version of OpenBlas, using lapack_e dgesvd() routine. For the same matrix it took 9 seconds to do the job. Hardware is Intel i5-2400 Quad 3.10GHz + Windows XP, 4 threads. — Teol11, Mar 02 '15 at 10:45
@Teol11 Looks like this thread has been dormant for some time, but if you are still reading, do you have an idea of how big a matrix has to be for the OpenMP to do some good in your code? I need something to speed up the decomp of a 7x6 matrix, which seems small to me. But don't blame me for grasping at straws. :-) Thanks. — bob.sacamento, Apr 10 '18 at 20:10

Z boson · Answer 2 · 2015-02-18T12:24:54.510

The FLOPS for Singular value decomposition (SVD) should go as O(N^3) whereas the reads as O(N^2). This means it may be possible to parallelize the algorithm and have it scale well with the number of cores.

However, your implementation of SVD is memory bandwidth bound which means it can't scale well with the number of cores for a single socket system. Your three nested loops currently each go over the entire range of N which causes much of the data to be evicted from the cache the next time it need to be reused.

In order to be compute bound you're going to have to change your algorithm so that, instead of operating on long strips/dot products of size N, it uses loop tiling to maximize the n^3 (where n is the size of the block) calculations in the cache of size n^2.

Here is paper for parallel SVD computation from a google search which says in the abstract

the key point of our proposed block JRS algorithm is reusing the loaded data into cache memory by performing computations on matrix blocks (b rows) instead of on strips of vectors as in JRS iteration algorithms.

I used loop tiling/blocks to achieve good scaling with the number of cores for cholesky decomposition.

Note that using tiles/blocks will improve the performance of your single threaded code as well.

Thank you for your analysis. It goes well beyond my skills as c++ programmer...-) — Teol11, Feb 18 '15 at 14:38

parallel SVD decomposition with openMP deos not perform as expected

2 Answers2