I need to run a multi-threaded matrix-vector multiplication every 500 microseconds. The matrix is the same, the vector changes every time.
I use Intels sgemv() in the MKL on a 64-core AMD CPU. If I compute the multiplications in a for-loop with no gaps in a little test program, it takes 20 microseconds per call of sgemv(). If I add a spin loop (polling the TSC) that takes about 500 microseconds to the for-loop, the time per sgemv() call increases to 30 microseconds if I use OMP_WAIT_POLICY=ACTIVE, with OMP_WAIT_POLICY=PASSIVE (the default), it goes even up to 60 microseconds.
Does anybody know what could be going on and why it is slower with the breaks? And what can be done to avoid this?
It doesn't seem to make a difference whether the spin loop is single-threaded or in a "#pragma omp parallel" context. It also makes no difference whether I keep the AVX units busy or not in the spin loop. CPU cores are isolated and the test program is running at a high priority and with SCHED_FIFO (on Linux, this is).
Spin wait function:
static void spin_wait(int num)
{
uint64_t const start = rdtsc();
while( rdtsc() - start < num )
{;}
}
for-loop
uint64_t t0[num], t1[num];
for( int i=0; i<num; i++ )
{
// modify input vector, just incrementing each element
t0[i] = rdtsc();
cblas_sgemv(...);
t1[i] = rdtsc();
spin_wait( 500us );
}