1

I'm calling intel MKL for csr format SpMV. To accelerate, I'm using multiple threads by mkl_set_num_threads. However, when the threads increases, the performance slows down. Any idea what's going on?

Though the documents say the thread number specified by mkl_set_num_threads only limits the maximum threads used, and less threads might be used, I expect the performance at least stays the same when using more threads. Here is my code:

for(int j = 1; j <= max_threads; ++j){

    mkl_set_num_threads(j);

    mkl_scsrmv(&transa, &m, &k, &alpha, matdescra, val, col_idx, row_idx, &row_idx[1], x, &beta, y); 
    start = clock();

    for(int i = 0; i < 100; ++i) { 
            mkl_scsrmv(&transa, &m, &k, &alpha, matdescra, val, col_idx, row_idx, &row_idx[1], x, &beta, y);
    }
    end = clock();
    elapsed = end - start;
    cout << "the float CSR spmv performance is " << (double)nnz  * ( (double)CLOCKS_PER_SEC / 10000000 ) / (double)elapsed  << " Gflops using " << j << " threads" << endl;
}

And here is the result:

  • the float CSR spmv performance is 0.634938 Gflops using 1 threads
  • the float CSR spmv performance is 0.535313 Gflops using 2 threads
  • the float CSR spmv performance is 0.494569 Gflops using 3 threads
  • the float CSR spmv performance is 0.483146 Gflops using 4 threads
  • the float CSR spmv performance is 0.421995 Gflops using 5 threads
  • the float CSR spmv performance is 0.417408 Gflops using 6 threads
  • the float CSR spmv performance is 0.386758 Gflops using 7 threads
  • the float CSR spmv performance is 0.393107 Gflops using 8 threads
  • the float CSR spmv performance is 0.378721 Gflops using 9 threads
  • the float CSR spmv performance is 0.354885 Gflops using 10 threads
  • the float CSR spmv performance is 0.328653 Gflops using 11 threads
  • the float CSR spmv performance is 0.31173 Gflops using 12 threads
  • the float CSR spmv performance is 0.302341 Gflops using 13 threads
  • the float CSR spmv performance is 0.281349 Gflops using 14 threads
  • the float CSR spmv performance is 0.273103 Gflops using 15 threads
  • the float CSR spmv performance is 0.26132 Gflops using 16 threads
  • the float CSR spmv performance is 0.238776 Gflops using 17 threads
  • the float CSR spmv performance is 0.218346 Gflops using 18 threads
  • the float CSR spmv performance is 0.210184 Gflops using 19 threads
  • the float CSR spmv performance is 0.19734 Gflops using 20 threads

    By the way, I'm using gcc to compile with -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl

    Any help would be appreciated. Thanks.

WalkerShaw
  • 11
  • 1
  • Do you know the size of the matrix? You achieved best performance with 1 thread, adding more threads seems to add only overhead. – supercheval Aug 03 '18 at 13:46

1 Answers1

1

clock() returns the total CPU time used by the process. You should measure real-world time instead.

Li Lingjie
  • 11
  • 2