I am implementing a sparse matrix-multivector multiplication in OpenMP and I've noticed that including a few printf calls in a pre-processing function, not parallelized, increments a lot the GFLOPS obtained from the subsequent parallelized product.
I've tried changing my measurement method from clock_gettime() to omp_get_wtime(), but the results were almost the same.
To help you understand what I'm talking about:
int main() {
// some stuff
// serial product computation
clock_gettime(CLOCK_MONOTONIC, &t1);
serial_product(a, x, k, y_s);
clock_gettime(CLOCK_MONOTONIC, &t2);
gflops_s = flop / ((t2.tv_sec - t1.tv_sec) * 1.e9 + (t2.tv_nsec - t1.tv_nsec));
pre_processing(...) // in here there's a loop to do some load balancing
clock_gettime(CLOCK_MONOTONIC, &t1);
openmp_spmm(a, rows_idx, num_threads, x, k, y_p);
clock_gettime(CLOCK_MONOTONIC, &t2);
gflops_p = flop / ((t2.tv_sec - t1.tv_sec) * 1.e9 + (t2.tv_nsec - t1.tv_nsec));
// some other stuff
}
As I was saying, if I don't put printf calls, as I did to do some basic debugging, in the loop of the pre_processing() function, gflops_p is around 2 GFLOPS. Meanwhile, if I use these printf calls, that has nothing to do with OpenMp, gflops_p jumps to 10 GFLOPS.
If it can help, this is the pre-processing function:
void pre_processing(int threads, int tot_nz, const int* irp, int tot_rows, int* rows_idx){
int j, nz, nz_prev = 0, nz_curr, start_row = 0, r_prev, r_curr = 0;
for (int i = 0; i < threads; i++) {
rows_idx[i] = start_row;
printf("."); // why this print enormously speeds up omp product???
nz_curr = ((i + 1) * tot_nz) / threads;
nz = nz_curr - nz_prev;
nz_prev = nz_curr;
for (j = start_row; j < tot_rows; j++) {
r_curr += irp[j + 1] - irp[j];
if (r_curr < nz) {
r_prev = r_curr;
} else {
start_row = ((r_curr - nz) < (nz - r_prev)) ? j + 1 : j;
break;
}
}
r_curr = 0;
}
rows_idx[threads] = tot_rows;
printf("\n");
}
I have no clue as to why this happens. Maybe something about stdout flushing or clock cycles or CPU utilization?
EDIT 1
The platform on which the program is running is a server with CentOS Stream 8 (Linux kernel 4.18.0-448.el8.x86_64) as OS and CPU with these characteristics:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
Stepping: 7
CPU MHz: 2200.000
CPU max MHz: 3200.0000
CPU min MHz: 1000.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
EDIT 2
My other guess has to do with AVX instructions used in the SpMM routine. Since it was pointed out that the problem might be the compiler optimization with vectorization and inlining. This is my first time using these sort of instructions, so I still don't fully understand how they impact on execution.
This is the SpMM code:
void spmm_csr_64(CSR *mat, const int* rows_load, int threads, const Type* x, int k, Type* y){
int *irp = mat->IRP;
int *ja = mat->JA;
Type *as = mat->AS;
const __m256i scale = _mm256_set1_epi32(k);
#pragma omp parallel for num_threads(threads) shared(threads, rows_load, irp, k, as, ja, x, y, scale) default(none)
for (int tid = 0; tid < threads; tid++) { // parallelize on threads' id
// private params
int j, z, iter, lim, r_y, r_x;
Type val;
__m256i cols;
__m512d vals, x_r;
__m512d t[k];
#pragma omp unroll partial
for (z = 0; z < k; z++) {
t[z] = _mm512_setzero_pd(); // init t vector
}
for (int i = rows_load[tid]; i < rows_load[tid + 1]; i++) { // thread gets the row to process
j = irp[i];
lim = (irp[i+1] - j) / 8;
for (iter = 0; iter < lim; iter++) {
vals = _mm512_loadu_pd(&as[j]); // load 8 64-bit elements
cols = _mm256_loadu_si256((__m256i*)&ja[j]); // load 8 32-bit elements columns
cols = _mm256_mullo_epi32(cols, scale); // scale col index to be on the first column of x
x_r = _mm512_i32gather_pd(cols, x, sizeof(Type)); // build 8 64-bit elements from x and cols
t[0] = _mm512_fmadd_pd(vals, x_r, t[0]); // execute a fused multiply-add
for (z = 1; z < k; z++) {
cols = _mm256_add_epi32(cols, _MM8_1);
x_r = _mm512_i32gather_pd(cols, x, sizeof(Type)); // build 8 64-bit elements from x and cols
t[z] = _mm512_fmadd_pd(vals, x_r, t[z]); // execute a fused multiply-add
}
j += 8;
}
r_y = i * k;
// remainder loop if elements are not multiple of size
for (; j < irp[i+1]; j++) {
val = as[j];
r_x = ja[j] * k;
#pragma omp unroll partial
for (z = 0; z < k; z++) {
y[r_y + z] += val * x[r_x + z];
}
}
// reduce all 64-bit elements in t by addition
#pragma omp unroll partial
for (z = 0; z < k; z++) {
y[r_y + z] += _mm512_reduce_add_pd(t[z]);
t[z] = _mm512_setzero_pd();
}
}
}
}
I use 'Type' as a placeholder for double, since I wanted to be able to change it to float whenever possible, but right now I'm working with double precision.
Any advise on how to improve the code is also greatly welcomed.