5

I came across a strange performance issue in a matrix multiply benchmark (matrix_mult in Metis from the MOSBENCH suite). The benchmark was optimized to tile the data such that the active working set was 12kb (3 tiles of 32x32 ints) and would fit into the L1 cache. To make a long story short, swapping the inner two most loops had a performance difference of almost 4x on certain array input sizes (4096, 8192) and about a 30% difference on others. The problem essentially came down to accessing elements sequentially versus in a stride pattern. Certain array sizes I think created a bad stride access that generated a lot cache line collisions. The performance difference is noticeably less when changing from 2-way associative L1 to an 8-way associative L1.

My question is why doesn't gcc optimize loop ordering to maximize sequential memory accesses?

Below is a simplified version of the problem (note that performance times are highly dependent on L1 configuration. The numbers indicated below are from 2.3 GHZ AMD system with 64K L1 2-way associative compiled with -O3).

N = ARRAY_SIZE // 1024
int* mat_A = (int*)malloc(N*N*sizeof(int));
int* mat_B = (int*)malloc(N*N*sizeof(int));
int* mat_C = (int*)malloc(N*N*sizeof(int));

// Elements of mat_B are accessed in a stride pattern of length N
// This takes 800 msec  
for (int t = 0; t < 1000; t++) 
   for (int a = 0; a < 32; a++) 
      for (int b = 0; b < 32; b++)
         for (int c = 0; c < 32; c++) 
            mat_C[N*a+b] += mat_A[N*a+c] * mat_B[N*c+b];

// Inner two loops are swapped
// Elements are now accessed sequentially in inner loop
// This takes 172 msec  
for (int t = 0; t < 1000; t++) 
   for (int a = 0; a < 32; a++) 
      for (int c = 0; c < 32; c++) 
         for (int b = 0; b < 32; b++)
            mat_C[N*a+b] += mat_A[N*a+c] * mat_B[N*c+b];
Mark
  • 3,177
  • 4
  • 26
  • 37
  • 4
    Because it's very hard to prove for the compiler that such a change doesn't change the semantics of the operation. – Sven Feb 28 '12 at 06:01
  • maybe you find it interesting to google gcc+graphite, which was a now merged branch for loop transformations based on a polyhedral model. There's somewhere a list of possible transformations. – Sebastian Mach Feb 28 '12 at 06:12
  • I would think that since integer addition is commutative that it would be fairly simple for a compiler to prove that the operation is invariant to looper ordering. I'm wondering what it is then about the code example that makes it non-trivial to optimize. – Mark Feb 29 '12 at 02:32

2 Answers2

1
  1. gcc might not be able to prove that the pointers don't overlap. If you are fine using non standard extensions you could try using __restrict.
  2. gcc doesn't take full advantage of your architecture to avoid the necessity to recompile for every processor. Using the option -march with the appropriate value for your system might help.
taurdir
  • 11
  • 1
  • It's worth noting that restrict _is_ standard for C99, just not for C++. See also http://stackoverflow.com/questions/776283/what-does-the-restrict-keyword-mean-in-c – torek Mar 10 '12 at 21:43
  • @torek I know, just wanted to make clear that it's a non-standard extension but supported by many compilers. – taurdir Mar 10 '12 at 22:02
0

gcc has a bunch of optimizations that just do what you want.

Look up the -floop-strip-mine and -floop-block compiler options.

Quote from the manual:

Perform loop blocking transformations on loops. Blocking strip mines each loop in the loop nest such that the memory accesses of the element loops fit inside caches. The strip length can be changed using the loop-block-tile-size parameter.

Nils Pipenbrinck
  • 83,631
  • 31
  • 151
  • 221
  • Thanks, though the problem isn't due to the inner loops not fitting in cache, as that optimization was already done by hand in the code. The inner loops only have a 12kb data footprint. 3*1024 int values. – Mark Mar 10 '12 at 23:15