Is LAPACK slow on my system? And if so why?

Question

I am a newbie and I apologize, should I write something stupid.

I use LAPACK on a 2.7GHz CPU.

My expectations on the running times are based on Figure 4.1 in the publication Performance and Accuracy of The Lapack..., as well as on Fanfan's answer to this Stackoverflow Post.

Both computers mentioned (and especially the 2.2 GHz Opteron) should be quite a bit slower than my CPU. Yet, the Opteron seems to need less than a second for a typical random 1000X1000 symmetric matrix. My CPU needs about 2.2 seconds on average, with very small variations.

Have I overlooked some feature my implementation should have, in order to reach the Opteron?

The way I compile is:

g++ -Ddsyev=dsyev_ -o combo.x combo.cc -L/usr/local/lib -lgsl -llapack -lblas && ./combo.x

I have tried the options -O2 and -O3, which did not change anything.

The C++ code I use to measure the times is the one below. It is my attempt to use the official example here with a meaningfully large matrix. And to do this many times, just to be sure I am not getting any peculiar random matrices by chance.

#include <stdlib.h>
#include <stdio.h>
#include <fstream>
#include <vector>
#include <sys/time.h>
#include <gsl/gsl_randist.h>
#include <gsl/gsl_matrix.h>
/* Parameters */
#define N 1000
#define REPS 100
#define LDA N
#define TOWERSIZE N

struct TOWER 
{
    gsl_matrix *matrix_random_covariance;

};


int COMPUTE_RANDOM_COVARIANCE(struct TOWER *tower);
int WRITE_RANDOM_COVARIANCE(struct TOWER *tower);

int read_covariance (std::vector<double> & data)
  {
    double tmp;

    std::ifstream fin("random_covariance.data");

    while(fin >> tmp)
    {
        data.push_back(tmp);
    }

    return 0;
}

/* DSYEV prototype */
extern "C"{
void dsyev( char* jobz, char* uplo, int* n, double* a, int* lda,
                double* w, double* work, int* lwork, int* info );
}
/* Auxiliary routines prototypes */
extern "C"{ 
void print_matrix( char* desc, int m, int n, double* a, int lda );
}

/* Main program */
int main() {
        std::vector<double> time_series;
    double time_series_sum =0.0;
    double time_series_average =0.0;

        struct TOWER *tower;
        tower = (struct TOWER *)calloc(1,sizeof(struct TOWER));
            double* work;
    for (uint repj = 0; repj < REPS; repj++)
        {
        COMPUTE_RANDOM_COVARIANCE(tower);
        WRITE_RANDOM_COVARIANCE(tower);
        printf( "-----------------------------------------------> Entry main.\n" );
        /* Locals */
        std::vector<double> data;
        int n = N, lda = LDA, info, lwork;
        double wkopt;
        /* Local arrays */
    double w[N];
    static double a[LDA*N];
        printf( "-----------------------------------------------> Point 0.\n" );
    read_covariance(data);
        printf( "-----------------------------------------------> Point 1.\n" );
        std::copy(data.begin(), data.end(), a);
        printf( "-----------------------------------------------> Point 2.\n" );

    struct timeval tval_before, tval_after, tval_result;\
    gettimeofday(&tval_before, NULL);

    /* Executable statements */
        printf( " DSYEV Example Program Results\n" );
        /* Query and allocate the optimal workspace */
        lwork = -1;
        dsyev( "Vectors", "Upper", &n, a, &lda, w, &wkopt, &lwork, &info );
        printf( "-----------------------------------------------> Point 4.\n" );
        lwork = (int)wkopt;
        work = (double*)malloc( lwork*sizeof(double) );
        /* Solve eigenproblem */
        dsyev( "Vectors", "Upper", &n, a, &lda, w, work, &lwork, &info );
        /* Check for convergence */
        if( info > 0 ) {
                printf( "The algorithm failed to compute eigenvalues.\n" );
                exit( 1 );
        }
            gettimeofday(&tval_after, NULL);
            timersub(&tval_after, &tval_before, &tval_result);
            printf("Time one diag / secs: %ld.%06ld\n", (long int)tval_result.tv_sec, (long int)tval_result.tv_usec);
                time_series.push_back(tval_result.tv_sec);
        time_series_sum = time_series_sum + tval_result.tv_sec + 0.000001*tval_result.tv_usec;
        }

        time_series_average = time_series_sum/REPS;
        printf("Time Series Average = %e\n", time_series_average);
        printf("Time Series Length = %u\n", REPS);
        /* Print eigenvalues */
        //print_matrix( "Eigenvalues", 1, n, w, 1 ); //nocheat!!
        /* Print eigenvectors */
        //print_matrix( "Eigenvectors (stored columnwise)", n, n, a, lda ); //nocheat!!
        /* Free workspace */
        free( (void*)work );
    free(tower);
        exit( 0 );
} /* End of DSYEV Example */

/* Auxiliary routine: printing a matrix */
void print_matrix( char* desc, int m, int n, double* a, int lda ) {
        int i, j;
        printf( "\n %.2f\n", desc );
        for( i = 0; i < m; i++ ) {
                for( j = 0; j < n; j++ ) printf( " %.2f", a[i+j*lda] );
                printf( "\n" );
        }
}
/* ---  --- */
int COMPUTE_RANDOM_COVARIANCE(struct TOWER *tower) //relying on spline !!
{
    int a,b;
    double aux, correl;
    const gsl_rng_type * T;
    gsl_rng * r;

    T = gsl_rng_default;
    r = gsl_rng_alloc (T);

    struct timeval tv;


    srand((time(0)));                // srand & time are built-in
    unsigned long int s = random();  // gsl_rng_uniform will eventually
                                     // want a non-negative "long" integer
    gsl_rng_env_setup();
    gsl_rng_set(r,s); // seed the random number generator;



    printf("extreme_kappa: building random covariance matrix....\n");
//#pragma omp parallel
//{     
    tower->matrix_random_covariance = gsl_matrix_calloc(TOWERSIZE,TOWERSIZE);

    for(a=0;a<TOWERSIZE;a++)  
    {           

//#pragma omp for       
        for(b=a;b<TOWERSIZE;b++)
        {



            correl = gsl_ran_flat(r,0.0,1.0);
            gsl_matrix_set(tower->matrix_random_covariance,a,b,correl);
            gsl_matrix_set(tower->matrix_random_covariance,b,a,correl);
        }
    }
//} //PARALLEL LOOP!!
    gsl_rng_free(r);
    return(0);
}
/* ---  --- */
/* --- --- */
int WRITE_RANDOM_COVARIANCE(struct TOWER *tower)
{
    int a,b;
    FILE *file;
    char fname[256];
    double aux;

    printf("extreme_kappa: writing random covariances....\n");

    sprintf(fname,"./random_covariance.data");
    file = fopen(fname,"w+");
    if(file == NULL)
    {
        printf("extreme_kappa: error opening file %s\n",fname);
        exit(2);
    }

    for(a=0;a<TOWERSIZE;a++)
    {
        for(b=0;b<TOWERSIZE;b++)
        {
            aux = gsl_matrix_get(tower->matrix_random_covariance,a,b);
            fprintf(file,"%e ",aux);
        }
        fprintf(file,"\n");
    }

    fclose(file);
    return(0);
}
/* --- --- */

I shall be grateful for any hints!

If you have "why is .... slow" questions, just profile your code, the profiler will point you right to the slow spots — Cory Kramer, Aug 05 '15 at 16:40
@CoryKramer Could you add something more? I don't Even know what a profiler is. Is it something like valgrind? — Ludi, Aug 05 '15 at 16:42
Similar to valgrind, but valgrind looks for memory leaks (among other things). A performance profiler looks at CPU usage and how long different operations take, for example see some screenshots for the [Very Sleepy](http://www.codersnotes.com/sleepy/) program — Cory Kramer, Aug 05 '15 at 16:45
@CoryKramer Thank you! I will check immediately. But I was hoping a LAPACK user may point me to something obvious. There are some here :) — Ludi, Aug 05 '15 at 16:46
Have you commented out the omp pragmas? That code runs on a single processor while I'd expected the omp version to use all available cores. — Axel, Aug 05 '15 at 16:48
Are you compiling the code with or without optimizations? What about debug information? — TAS, Aug 05 '15 at 16:53
@Axel Thank you! I removed those parallelisations, as they don't have anything to do with the real problem: the lapack performance. They were just for speeding up the creation of the matrix tested. I am curious to know wether my references use any parallelisation I overlooked. But I guess something special is needed to use parallel lapack. not just an omp pragma... — Ludi, Aug 05 '15 at 17:09
@TAS Sorry, I am not very skilled. What optimisations do you have in mind? — Ludi, Aug 05 '15 at 17:10
Do you get the expected performance out of LAPACK's test programs? Also bear in mind that LAPACK is written in fortran so any calls from C/C++ will need to have their inputs transposed, which will cause a performance hit. — anonymous coward, Aug 05 '15 at 18:25
@Ludi it depends on your compiler, for `g++` you can try using `-O2` or `-O3`. — TAS, Aug 05 '15 at 19:07
@anonymouscoward Many thanks. The programme posted is basically my attempt to use the official example here: https://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_lapack_examples/dsyev_ex.c.htm with a meaningfully large matrix. And to do this many times, just to be sure I am not studying peculiar matrices. — Ludi, Aug 06 '15 at 13:56
Hm. We're kind of comparing apples and oranges here, you've just got different overhead. Run `lapack_testing.py`, it should come in your LAPACK distribution and it's fairly comprehensive. Also look [here](http://www.netlib.org/lapack/lawn41/node27.html). I'm having a hard time finding times for you to compare to but if you can get your hands on another rig (hell, you could spend 10 cents and rent an EC2 instance) you should be able to see if something is egregiously wrong with your setup vs your program's overhead. — anonymous coward, Aug 06 '15 at 14:19
@anonymouscoward Sorry, I am very newbie. What do you mean by overhead? I think I am only measuring the time spent in the part of the programme that diagonalises. Right? All the creation of the matrix etc is not measured. So I hoped to get the same times as the publication. — Ludi, Aug 06 '15 at 14:43

Is LAPACK slow on my system? And if so why?

0 Answers0