9

I am working on performance of paralle algorithm on Multicore machine. I did an experiment on Matrix Multiplication with loop reordering (ikj) technique.

The serial execution result is as in images below.L1 data cache hit for loop order ikj and kij for all size of nXn matrix is near 100% (Image 1 box number 1 & 2) and as you can see loop order ikj in size 2048 and 4096 suddenly L2 data cach hit decrease by %50 (Image 2 box number 1 & 2) also in L2 instruction cache hit the same is true. Incase where L1 data cache hit for these 2 size are like other sizes(256,512,1024) is about %100. I could not find any resonable reason for this slope in both Instruction and data cache hit. could any one give me clue on how to find the reason(s)?

do you think that L2 unified cache has any effect on exacerbating the problem? But still what causes this reduction what characteristic of algorithm and performance should I profile to find reason.

experimental machine is Intel e4500 with 2Mb L2 cache,cache line 64, os is fedora 17 x64 with gcc 4.7 -o no compiler optimization

Abridged & Complete Question? my problem is that why sudden decrease of about 50% in both L2 data and instruction cache happens in only ikj & kij algorithm as it's boxed and numbered 1 & 2 in images, but not in other loop variation?

                                   *Image 1*

enter image description here

                                    *Image 2*

enter image description here

                                    *Image 3*

enter image description here

                                   *Image 4*

enter image description here

                                   *Image 5*

Despite the aformentioned problem there is no increase in timing of ikj&kij algorithm. But also is faster than others. enter image description here

ikj and kij algorithm are two variation of loop reordering technique/

kij Algorithm

   For (k=0;k<n;k++)
    For(i=0;i<n;i++){
        r=A[i][k];
      For (j=0;j<n;j++)
          C[i][j]+=r*B[k][j] 
    }

ikj Algorithm

For (i=0;i<n;i++)
     For(k=0;k<n;k++){
      r=A[i][k];
      For (j=0;j<n;j++)
           C[i][j]+=r*B[k][j] 
    }   

thanks

mjr
  • 165
  • 1
  • 9
  • That's not possible, the loops easily fit in the L1 instruction cache so you should never see a decrease in the L2 instruction hit percentage. Your profiler must be borken or you are measuring a side effect of the thread getting pre-empted far too often. – Hans Passant Oct 12 '13 at 14:09
  • @hans How do you say the loop will fit in instruction cache? also i ran parallel version it also has variation in L1 instruction hit also in L2 instruction hit only the box 1 & 2 happend in ikj and kij algorithm!! – mjr Oct 12 '13 at 14:18
  • I did perform profiling with one level of blocking but this variation has not seen therefore I am now confused!!! – mjr Oct 12 '13 at 14:23
  • The loops might fit in the L1 cache if they compiled. But they don't. – doctorlove Oct 12 '13 at 14:31
  • @doctorlove I could not understand if we are on the same point? although algorithm are different in the way access data(locality) but don't use any blocking factor in-order fit data in L1 cache. – mjr Oct 12 '13 at 15:16
  • I mean s/For/for and you have missing semicolons in this code so I wondered what your actual code was – doctorlove Oct 12 '13 at 15:55
  • 1
    Have you tried sizes that are not powers of two? – Mysticial Oct 12 '13 at 17:35
  • The E4500 has a 32kB L1 instruction cache and a 32kB L1 data cache per core. The L2 cache is unified, for instructions and data, for both cores. So what does "L2 instruction hit" mean? – Gabe Oct 12 '13 at 18:15
  • 3
    which profiling tool are you using? – villekulla Oct 13 '13 at 18:17
  • Papi, It uses hardware counter. – mjr Oct 14 '13 at 07:21
  • Using other tools, for example perf, do you get the same strange L2 cache hit? Did you try to vary the sampling rate to rule out any aliasing effects? As there is no visible performance degradation, the measured values look strange, but maybe the L1 prefetcher is just doing a great job... – villekulla Oct 14 '13 at 08:48

1 Answers1

4

I bet this happens because of the super-alignment issue discussed in the answer of following questions:

I hope it is understandable that I don't like to copy&paste from those answers.

Community
  • 1
  • 1
villekulla
  • 1,039
  • 5
  • 15
  • Thank you @villekulla it was a nice try and also nice articles but my problem is that why sudden decrease of about 50% happens in only ikj & kij algorithm as it's boxed and numbered 1 & 2 in both L2 data and instruction cache but not in other loop variation, I should mention that former aforementioned (ikj&kij) algorithm besides this amount of miss in L2 cache has No Timing problem both in serial and parallel version but also are faster because of the way they access data and also as those articles says has less load ,store and misses .I added timing line chart as Image 5 you can see that. – mjr Oct 13 '13 at 05:31
  • I should have add this up that when I used blocking technique with ikj order then such pattern not seen in any blocking factors(BF) or size. although definitly there were variation among BF's. – mjr Oct 13 '13 at 05:48