I have written a parallel program using OpenMP. Since there are 8 cores in my machine, I spawn 8 threads. Using the command "sar -p ALL 1 20", I could see that the I/O wait percentage for all the cores is very high.
Based on another SO post, I found that callgrind is a good tool to profile C++ applications, but it does not work for my code. I am using OpenBLAS,and valgrind complains that it is unable to recognize OpenBLAS functions.
Can someone please tell me how I could track down exactly where in my code the problem lies.