1

I am implementing a sparse matrix-multivector multiplication in OpenMP and I've noticed that including a few printf calls in a pre-processing function, not parallelized, increments a lot the GFLOPS obtained from the subsequent parallelized product.

I've tried changing my measurement method from clock_gettime() to omp_get_wtime(), but the results were almost the same.

To help you understand what I'm talking about:

int main() {

    // some stuff

    // serial product computation
    clock_gettime(CLOCK_MONOTONIC, &t1);
    serial_product(a, x, k, y_s);
    clock_gettime(CLOCK_MONOTONIC, &t2);

    gflops_s = flop / ((t2.tv_sec - t1.tv_sec) * 1.e9 + (t2.tv_nsec - t1.tv_nsec));

    pre_processing(...) // in here there's a loop to do some load balancing  

    clock_gettime(CLOCK_MONOTONIC, &t1);
    openmp_spmm(a, rows_idx, num_threads, x, k, y_p);
    clock_gettime(CLOCK_MONOTONIC, &t2);

    gflops_p = flop / ((t2.tv_sec - t1.tv_sec) * 1.e9 + (t2.tv_nsec - t1.tv_nsec));

    // some other stuff
}

As I was saying, if I don't put printf calls, as I did to do some basic debugging, in the loop of the pre_processing() function, gflops_p is around 2 GFLOPS. Meanwhile, if I use these printf calls, that has nothing to do with OpenMp, gflops_p jumps to 10 GFLOPS.

If it can help, this is the pre-processing function:

void pre_processing(int threads, int tot_nz, const int* irp, int tot_rows, int* rows_idx){
    int j, nz, nz_prev = 0, nz_curr, start_row = 0, r_prev, r_curr = 0;

    for (int i = 0; i < threads; i++) {
        rows_idx[i] = start_row; 

        printf("."); // why this print enormously speeds up omp product???

        nz_curr = ((i + 1) * tot_nz) / threads;
        nz = nz_curr - nz_prev;
        nz_prev = nz_curr;

        for (j = start_row; j < tot_rows; j++) {
            r_curr += irp[j + 1] - irp[j]; 

            if (r_curr < nz) { 
                r_prev = r_curr; 
            } else {
                start_row = ((r_curr - nz) < (nz - r_prev)) ? j + 1 : j;
                break;
            }
        }

        r_curr = 0;
    }
    rows_idx[threads] = tot_rows;
    printf("\n");
}

I have no clue as to why this happens. Maybe something about stdout flushing or clock cycles or CPU utilization?

EDIT 1

The platform on which the program is running is a server with CentOS Stream 8 (Linux kernel 4.18.0-448.el8.x86_64) as OS and CPU with these characteristics:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
Stepping:            7
CPU MHz:             2200.000
CPU max MHz:         3200.0000
CPU min MHz:         1000.0000
BogoMIPS:            4400.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            14080K
NUMA node0 CPU(s):   0-9,20-29
NUMA node1 CPU(s):   10-19,30-39

EDIT 2

My other guess has to do with AVX instructions used in the SpMM routine. Since it was pointed out that the problem might be the compiler optimization with vectorization and inlining. This is my first time using these sort of instructions, so I still don't fully understand how they impact on execution.

This is the SpMM code:

void spmm_csr_64(CSR *mat, const int* rows_load, int threads, const Type* x, int k, Type* y){

    int *irp = mat->IRP;
    int *ja = mat->JA;
    Type *as = mat->AS;

    const __m256i scale = _mm256_set1_epi32(k);

    #pragma omp parallel for num_threads(threads) shared(threads, rows_load, irp, k, as, ja, x, y, scale) default(none)
    for (int tid = 0; tid < threads; tid++) {   // parallelize on threads' id
        // private params
        int j, z, iter, lim, r_y, r_x;
        Type val;
        __m256i cols;
        __m512d vals, x_r;

        __m512d t[k];
        #pragma omp unroll partial
        for (z = 0; z < k; z++) {
            t[z] = _mm512_setzero_pd(); // init t vector
        }

        for (int i = rows_load[tid]; i < rows_load[tid + 1]; i++) { // thread gets the row to process
            j = irp[i];
            lim = (irp[i+1] - j) / 8;

            for (iter = 0; iter < lim; iter++) {

                vals = _mm512_loadu_pd(&as[j]);                     // load 8 64-bit elements
                cols = _mm256_loadu_si256((__m256i*)&ja[j]);        // load 8 32-bit elements columns
                cols = _mm256_mullo_epi32(cols, scale);             // scale col index to be on the first column of x
                x_r = _mm512_i32gather_pd(cols, x, sizeof(Type));   // build 8 64-bit elements from x and cols
                t[0] = _mm512_fmadd_pd(vals, x_r, t[0]);            // execute a fused multiply-add

                for (z = 1; z < k; z++) {
                    cols = _mm256_add_epi32(cols, _MM8_1);
                    x_r = _mm512_i32gather_pd(cols, x, sizeof(Type));      // build 8 64-bit elements from x and cols
                    t[z] = _mm512_fmadd_pd(vals, x_r, t[z]);    // execute a fused multiply-add
                }

                j += 8;
            }

            r_y = i * k;

            // remainder loop if elements are not multiple of size
            for (; j < irp[i+1]; j++) {
                val = as[j];
                r_x = ja[j] * k;

                #pragma omp unroll partial
                for (z = 0; z < k; z++) {
                    y[r_y + z] += val * x[r_x + z];
                }
            }

            // reduce all 64-bit elements in t by addition
            #pragma omp unroll partial
            for (z = 0; z < k; z++) {
                y[r_y + z] += _mm512_reduce_add_pd(t[z]);
                t[z] = _mm512_setzero_pd();
            }
        }
    }
}

I use 'Type' as a placeholder for double, since I wanted to be able to change it to float whenever possible, but right now I'm working with double precision.

Any advise on how to improve the code is also greatly welcomed.

lilith
  • 63
  • 5
  • `printf` change the synchronization between threads so such an effect is unfortunately not so rare in practice. They also impact compiler optimization (including vectorization and inlining). Thus, in some rare cases, adding `printf` can avoid compilers doing some inefficient optimizations. Understanding why this happens is hard with the complete code, and often nearly impossible without. Please make a minimal reproducible exemple if possible. This will help you to better locate the issue and certainly help us to reproduce the problem. – Jérôme Richard Apr 02 '23 at 12:53
  • Dump the code to assembler and have a look. – David C. Rankin Apr 02 '23 at 14:00
  • Please also specify the reference of the CPU used to make the benchmark and the operating system (as well as the target platform so to know if any other programs could run on it). – Jérôme Richard Apr 02 '23 at 14:06
  • @JérômeRichard I'll try to upload more code, but the whole program is a bit complex. As for the platform, I'll edit my question right now. – lilith Apr 02 '23 at 14:32
  • 1
    The target processor is a NUMA machine. I got similar issue in the past, but during different run of the same program. To discard NUMA effects, can you : 1. bind threads to cores so to avoid the scheduler to move threads to different cores (closer to the IO-related interconnect), 2. use `numactl` with a memory binding to the node 0 so to be sure pages are only allocated on the NUMA node 0 and restrict the process to the first NUMA node (just to be sure there is no possible NUMA-related effects). This should be done with+without the target `printf`. – Jérôme Richard Apr 02 '23 at 17:36
  • You can find some information about binding OpenMP threads and numactl in the following posts: https://stackoverflow.com/a/64415109/12939557 and https://stackoverflow.com/a/62615032/12939557. Besides, it would be great to run `perf stat` in both cases so to provide information about what the processor does in both cases (it might require multiple run since we do not know exactly where the problem could be). For example, we can compare the number of SIMD instructions with and without the `printf`, but also the frequency, cache misses, etc. – Jérôme Richard Apr 02 '23 at 17:45
  • @JérômeRichard Thanks a lot! It was very helpful. I will try to test your suggestions as I can, given that I don't own the server and I don't have permission to change its settings, unfortunately. – lilith Apr 02 '23 at 20:36

0 Answers0