0

I came across this great youtube tutorial and in one of the slides I saw something I didn't understand. Why is this happening? Is this compiler related?

Version #1 looks like this:

const int N = 5000;
float a [N*N];

for (int x=0; x<n; ++x)
     for(int y=0; y<N; ++y)
          sum+=a[x+y*N];

and takes about 239.4ms to execute.

And version #2 looks like this:

const int N = 5000;
float a [N*N];

for (int y=0; y<n; ++y)
     for(int x=0; x<N; ++x)
          sum+=a[x+y*N];

and takes about 79.5ms to execute. Why is this happening?

chendoy
  • 155
  • 1
  • 2
  • 12
  • @user202729 well I did not linked my question to another site, just added a refernce to the actual video for credit, meaning i'm not the owner of the content. but thanks for your note. – chendoy Oct 20 '18 at 13:09
  • changed it, noted to myself. – chendoy Oct 20 '18 at 13:19

1 Answers1

5

The second example demonstrates better data locality since it accesses elements in the same row. Basically it performs sequential memory read, while first example jumps over sizeof(float) * N bytes on each iteration putting extra stress on CPU cache / memory.

user7860670
  • 35,849
  • 4
  • 58
  • 84
  • Can it be that the compiler optimizes the value of the multiplication, so in the second case it does not have to reevaluate the value of y*N on every loop iteration? – kangaro0 Oct 20 '18 at 13:04
  • 1
    @kangaro0 Actually I think that multiplication can be optimized (by converting into addition) in both cases. In first case it will probably be just `+ N` and `+ 1` in second case. – user7860670 Oct 20 '18 at 13:08
  • Accessing contiguous memory is always faster. As @VTT said, data locality! – 7raiden7 Oct 20 '18 at 13:31