2

I'm testing performance of ?GEMM, ?TRMM, ?TRSM using MKL's automatic offload on the new Intel Xeon Phi coprocessors and am having some issues with DTRMM and DTRSM. I have code to test the performance for matrix size in steps of 1024 up to 10240 and performance seems to drop off significantly somewhere after N=M=K=8192. When I try testing exactly where by using step sizes of 2, my script was hanging. I then checked 512 step sizes, which work fine, 256 work as well, but anything under 256 just stalls. I cannot find any known issues in regards to this problem. All single precision versions work, as well as single and double precision on ?GEMM. Here is my code:

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <stdint.h>
#include <time.h>
#include "mkl.h"

#define DBG 0

int main(int argc, char **argv)
{
   char transa = 'N', side = 'L', uplo = 'L', diag = 'U';
   MKL_INT N, NP; // N = M, N, K, lda, ldb, ldc
   double alpha = 1.0; // Scaling factors 
   double *A, *B; // Matrices 
   int matrix_bytes; // Matrix size in bytes 
   int matrix_elements; // Matrix size in elements
   int i, j; // Counters
   int msec;
   clock_t start, diff;

   N = atoi(argv[1]);

   start = clock();

   matrix_elements = N * N;
   matrix_bytes = sizeof(double) * matrix_elements;

   // Allocate the matrices
   A = malloc(matrix_bytes);
   if (A == NULL)
   {
      printf("Could not allocate matrix A\n");
      return -1;
   }

   B = malloc(matrix_bytes);
   if (B == NULL)
   {
      printf("Could not allocate matrix B\n");
      return -1;
   }

   for (i = 0; i < matrix_elements; i++)
   {
      A[i] = 0.0;
      B[i] = 0.0;
   }

   // Initialize the matrices
   for (i = 0; i < N; i++)
      for (j = 0; j <= i; j++)
      {
         A[i+N*j] = 1.0;
         B[i+N*j] = 2.0;
      }

   // DTRMM call
   dtrmm(&side, &uplo, &transa, &diag, &N, &N, &alpha, A, &N, B, &N);

   diff = clock() - start;
   msec = diff * 1000 / CLOCKS_PER_SEC;
   printf("%f\n", (float)msec * 10e-4);

   if (DBG == 1)
   {
      printf("\nMatrix dimension is set to %d \n\n", (int)N);

      // Display the result
      printf("\nResulting matrix B:\n");
      if (N > 10)
      {
         printf("NOTE: B is too large, print only upper-left 10x10 block...\n");
         NP = 10;
      }
      else
         NP = N;

      printf("\n");
      for (i = 0; i < NP; i++)
      {
         for (j = 0; j < NP; j++)
            printf("%7.3f ", B[i + j * N]);
         printf("\n");
      }
   }

   // Free the matrix memory
   free(A);
   free(B);

   return 0;
}

Any help or insight would be greatly appreciated.

mjswartz
  • 715
  • 1
  • 6
  • 19

2 Answers2

2

This phenomenon has been extensively discussed in other questions, and also in Intel's Software Optimization Manual and Agner Fog's notes.

Typically, you are experiencing a perfect storm of evictions in the memory hierarchy, such that suddenly (nearly) every single access misses cache and/or TLB (one can determine exactly which resource is missing by looking at the specific data access pattern or by using the PMCs; I can do the calculation later when I'm near a whiteboard, unless mystical gets to you first).

You can also search through some of my or Mystical's answers to find previous answers.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • I'll start searching through your answers. Thanks for the response! – mjswartz Feb 20 '13 at 16:46
  • Actually, if you could point me to the question topic, that would be much appreciated. 26 pages of answers is a lot to browse! – mjswartz Feb 20 '13 at 16:57
  • here's some discussion in the context of a naive matrix multiply: (http://stackoverflow.com/questions/7905760/matrix-multiplication-small-difference-in-matrix-size-large-difference-in-timi). The mechanics of your case is a bit different because MKL does cache blocking, but you are experiencing essentially the same phenomenon. I'll add more detail later today. – Stephen Canon Feb 20 '13 at 17:09
  • mystical also talks a bit about the issue here: http://stackoverflow.com/questions/9515482/performance-advantages-of-powers-of-2-sized-data – Stephen Canon Feb 20 '13 at 17:10
  • We may be on different pages. I'm not too concerned about the performance drop off (the linked discussions make perfect sense). What bothers me is that N=8192 takes about 10 seconds to run offloading 100% of the work to the MIC. N=8292 does not run at all. It simply hangs. No errors or anything, but it just sits there. It's not a slowdown in performance due to cache size, it's a dead stop. – mjswartz Feb 20 '13 at 17:18
  • I'm currently testing which values the program hangs on. I'm getting some strange results. It seems to work for every value up to 7616 where it hangs up through 7679. It works for 7680 and 7681, but hangs from 7682 up to 8191. I haven't tested between 8192 and the 9000's because the system I'm on is under maintenance now, but I plan on continuing when it is back up. – mjswartz Feb 20 '13 at 20:40
0

The issue was an older version of Intel's icc compiler (beta 10 update, I believe.. maybe). Gold update works like a charm.

mjswartz
  • 715
  • 1
  • 6
  • 19