6

I have a routine that performs a few MKL calls on small matrices (50-100 x 1000 elements) to fit a model, which I then call for different models. In pseudo-code:

double doModelFit(int model, ...) {
   ...
   while( !done ) {
     cblas_dgemm(...);
     cblas_dgemm(...);
     ...
     dgesv(...);
     ...
   }
   return result;
}

int main(int argc, char **argv) {
  ...
  c_start = 1;  c_stop = nmodel;
  for(int c=c_start; c<c_stop; c++) {
    ...
    result = doModelFit(c, ...);
    ...
  }
}

Call the above version 1. Since the models are independent, I can use OpenMP threads to parallelize the model fitting, as follows (version 2):

int main(int argc, char **argv) {
  ...
  int numthreads=omp_max_num_threads();
  int c;
#pragma omp parallel for private(c)
  for(int t=0; t<numthreads; t++) {  
     // assuming nmodel divisible by numthreads...      
     c_start = t*nmodel/numthreads+1; 
     c_end = (t+1)*nmodel/numthreads;
     for(c=c_start; c<c_stop; c++) {
        ...
        result = doModelFit(c, ...);
        ...
     }
  }
}

When I run version 1 on the host machine, it takes ~11 seconds and VTune reports poor parallelization with most of the time spent idle. Version 2 on the host machine takes ~5 seconds and VTune reports great parallelization (near 100% of the time is spent with 8 CPUs in use). Now, when I compile the code to run on the Phi card in native mode (with -mmic), versions 1 and 2 both take approximately 30 seconds when run on the command prompt on mic0. When I use VTune to profile it:

  • Version 1 takes the same roughly 30 seconds, and the hotspot analysis shows that most time is spent in __kmp_wait_sleep and __kmp_static_yield. Out of 7710s CPU time, 5804s are spent in Spin Time.
  • Version 2 takes fooooorrrreevvvver... I kill it after running a couple minutes in VTune. The hotspot analysis shows that of 25254s of CPU time, 21585s are spent in [vmlinux].

Can anyone shed some light on what's going on here and why I'm getting such bad performance? I'm using the default for OMP_NUM_THREADS and set KMP_AFFINITY=compact,granularity=fine (as recommended by Intel). I'm new to MKL and OpenMP, so I'm certain I'm making rookie mistakes.

Thanks, Andrew

Andrew
  • 867
  • 7
  • 20
  • 1
    MKL has some serious performance problems with small matrices on Phi. I recommend posting your question on the Intel forums: http://software.intel.com/en-us/forums/intel-many-integrated-core – pburka Nov 01 '13 at 19:52
  • 1
    @pburka I've posted over there too. Just trying to cast a wider net. :) Do you have a link for the small matrix problems? – Andrew Nov 01 '13 at 20:00
  • 1
    Here's one problem http://software.intel.com/en-us/forums/topic/475924 . I'm also following up with Intel through Premier Support. In this case I believe that MKL reduces the number of threads to 30, and then it takes 1ms to spin the threads back up after the GEMM call. But I also believe that this is not the only GEMM performance problem. – pburka Nov 01 '13 at 20:21
  • 1
    @pburka I just saw your post over at Intel, and your sizes are comparable to mine. I didn't pad to get divisible by 64 dimensions (which I will try) but that doesn't seem like the right answer. Are you saying that (hypothetically) it is automatically spawning ~8 threads to compute models, each of which spawns 30 threads for dgemm? Or am I giving OpenMP too much credit? Do you know of a way to figure out who spawns what threads on the Phi? – Andrew Nov 01 '13 at 20:39
  • I don't know a good way. I stepped through MKL using gdb to figure out what it was doing. The OpenMP source is public, so you can instrument that. https://www.openmprtl.org/ – pburka Nov 03 '13 at 00:47

2 Answers2

1

The most probable reason for this behavior given that most of the time is spent in OS (vmlinux), is over-subscription caused by nested OpenMP parallel region inside MKL implementation of cblas_dgemm() and dgesv. E.g. see this example.

This version is supported and explained by Jim Dempsey at the Intel forum.

Anton
  • 6,349
  • 1
  • 25
  • 53
0

What about using MKL:sequential library? If you link MKL library with sequential option, it doesn't generate OpenMP threads inside of the MKL itself. I guess you may get better results than now.

Sunwoo Lee
  • 29
  • 3
  • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient [reputation](http://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](http://stackoverflow.com/help/privileges/comment). – durron597 Jul 29 '15 at 21:48
  • Generally, links or references to a tool or library should not merely include a specific explanation of how the recommended resource is applicable to the problem, but [also be accompanied by usage notes or some sample code](http://meta.stackoverflow.com/a/251605). (@durron597: Technically, this *is* an answer, just not all that amazing of one.) – Nathan Tuggy Jul 30 '15 at 00:21