MKL Performance on Intel Phi

Question

I have a routine that performs a few MKL calls on small matrices (50-100 x 1000 elements) to fit a model, which I then call for different models. In pseudo-code:

double doModelFit(int model, ...) {
   ...
   while( !done ) {
     cblas_dgemm(...);
     cblas_dgemm(...);
     ...
     dgesv(...);
     ...
   }
   return result;
}

int main(int argc, char **argv) {
  ...
  c_start = 1;  c_stop = nmodel;
  for(int c=c_start; c<c_stop; c++) {
    ...
    result = doModelFit(c, ...);
    ...
  }
}

Call the above version 1. Since the models are independent, I can use OpenMP threads to parallelize the model fitting, as follows (version 2):

int main(int argc, char **argv) {
  ...
  int numthreads=omp_max_num_threads();
  int c;
#pragma omp parallel for private(c)
  for(int t=0; t<numthreads; t++) {  
     // assuming nmodel divisible by numthreads...      
     c_start = t*nmodel/numthreads+1; 
     c_end = (t+1)*nmodel/numthreads;
     for(c=c_start; c<c_stop; c++) {
        ...
        result = doModelFit(c, ...);
        ...
     }
  }
}

When I run version 1 on the host machine, it takes ~11 seconds and VTune reports poor parallelization with most of the time spent idle. Version 2 on the host machine takes ~5 seconds and VTune reports great parallelization (near 100% of the time is spent with 8 CPUs in use). Now, when I compile the code to run on the Phi card in native mode (with -mmic), versions 1 and 2 both take approximately 30 seconds when run on the command prompt on mic0. When I use VTune to profile it:

Version 1 takes the same roughly 30 seconds, and the hotspot analysis shows that most time is spent in __kmp_wait_sleep and __kmp_static_yield. Out of 7710s CPU time, 5804s are spent in Spin Time.
Version 2 takes fooooorrrreevvvver... I kill it after running a couple minutes in VTune. The hotspot analysis shows that of 25254s of CPU time, 21585s are spent in [vmlinux].

Can anyone shed some light on what's going on here and why I'm getting such bad performance? I'm using the default for OMP_NUM_THREADS and set KMP_AFFINITY=compact,granularity=fine (as recommended by Intel). I'm new to MKL and OpenMP, so I'm certain I'm making rookie mistakes.

Thanks, Andrew

MKL has some serious performance problems with small matrices on Phi. I recommend posting your question on the Intel forums: http://software.intel.com/en-us/forums/intel-many-integrated-core — pburka, Nov 01 '13 at 19:52
@pburka I've posted over there too. Just trying to cast a wider net. :) Do you have a link for the small matrix problems? — Andrew, Nov 01 '13 at 20:00
Here's one problem http://software.intel.com/en-us/forums/topic/475924 . I'm also following up with Intel through Premier Support. In this case I believe that MKL reduces the number of threads to 30, and then it takes 1ms to spin the threads back up after the GEMM call. But I also believe that this is not the only GEMM performance problem. — pburka, Nov 01 '13 at 20:21
@pburka I just saw your post over at Intel, and your sizes are comparable to mine. I didn't pad to get divisible by 64 dimensions (which I will try) but that doesn't seem like the right answer. Are you saying that (hypothetically) it is automatically spawning ~8 threads to compute models, each of which spawns 30 threads for dgemm? Or am I giving OpenMP too much credit? Do you know of a way to figure out who spawns what threads on the Phi? — Andrew, Nov 01 '13 at 20:39
I don't know a good way. I stepped through MKL using gdb to figure out what it was doing. The OpenMP source is public, so you can instrument that. https://www.openmprtl.org/ — pburka, Nov 03 '13 at 00:47

score 1 · Answer 1 · answered Jul 31 '14 at 11:45

The most probable reason for this behavior given that most of the time is spent in OS (vmlinux), is over-subscription caused by nested OpenMP parallel region inside MKL implementation of cblas_dgemm() and dgesv. E.g. see this example.

This version is supported and explained by Jim Dempsey at the Intel forum.

score 0 · Answer 2 · answered Jul 29 '15 at 20:18

0

What about using MKL:sequential library? If you link MKL library with sequential option, it doesn't generate OpenMP threads inside of the MKL itself. I guess you may get better results than now.

answered Jul 29 '15 at 20:18

Sunwoo Lee

29
3

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient [reputation](http://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](http://stackoverflow.com/help/privileges/comment). – durron597 Jul 29 '15 at 21:48
Generally, links or references to a tool or library should not merely include a specific explanation of how the recommended resource is applicable to the problem, but [also be accompanied by usage notes or some sample code](http://meta.stackoverflow.com/a/251605). (@durron597: Technically, this *is* an answer, just not all that amazing of one.) – Nathan Tuggy Jul 30 '15 at 00:21

MKL Performance on Intel Phi

2 Answers2

Linked