I'm calling intel MKL for csr format SpMV. To accelerate, I'm using multiple threads by mkl_set_num_threads. However, when the threads increases, the performance slows down. Any idea what's going on?
Though the documents say the thread number specified by mkl_set_num_threads only limits the maximum threads used, and less threads might be used, I expect the performance at least stays the same when using more threads. Here is my code:
for(int j = 1; j <= max_threads; ++j){
mkl_set_num_threads(j);
mkl_scsrmv(&transa, &m, &k, &alpha, matdescra, val, col_idx, row_idx, &row_idx[1], x, &beta, y);
start = clock();
for(int i = 0; i < 100; ++i) {
mkl_scsrmv(&transa, &m, &k, &alpha, matdescra, val, col_idx, row_idx, &row_idx[1], x, &beta, y);
}
end = clock();
elapsed = end - start;
cout << "the float CSR spmv performance is " << (double)nnz * ( (double)CLOCKS_PER_SEC / 10000000 ) / (double)elapsed << " Gflops using " << j << " threads" << endl;
}
And here is the result:
- the float CSR spmv performance is 0.634938 Gflops using 1 threads
- the float CSR spmv performance is 0.535313 Gflops using 2 threads
- the float CSR spmv performance is 0.494569 Gflops using 3 threads
- the float CSR spmv performance is 0.483146 Gflops using 4 threads
- the float CSR spmv performance is 0.421995 Gflops using 5 threads
- the float CSR spmv performance is 0.417408 Gflops using 6 threads
- the float CSR spmv performance is 0.386758 Gflops using 7 threads
- the float CSR spmv performance is 0.393107 Gflops using 8 threads
- the float CSR spmv performance is 0.378721 Gflops using 9 threads
- the float CSR spmv performance is 0.354885 Gflops using 10 threads
- the float CSR spmv performance is 0.328653 Gflops using 11 threads
- the float CSR spmv performance is 0.31173 Gflops using 12 threads
- the float CSR spmv performance is 0.302341 Gflops using 13 threads
- the float CSR spmv performance is 0.281349 Gflops using 14 threads
- the float CSR spmv performance is 0.273103 Gflops using 15 threads
- the float CSR spmv performance is 0.26132 Gflops using 16 threads
- the float CSR spmv performance is 0.238776 Gflops using 17 threads
- the float CSR spmv performance is 0.218346 Gflops using 18 threads
- the float CSR spmv performance is 0.210184 Gflops using 19 threads
the float CSR spmv performance is 0.19734 Gflops using 20 threads
By the way, I'm using gcc to compile with -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl
Any help would be appreciated. Thanks.