7

I just wrote a program to test the impact of CPU cache on speed performance.

void* foo(void* ptr)
{
    int* r = (int*) ptr;

    for (int i = (r - s); i < N; i += NUM_THREADS)
        *r += num[i];        
    return NULL;
}

void* bar(void* ptr)
{
    int* r = (int*) ptr;
    int idx = r - s;
    int block = N/NUM_THREADS;
    int start = idx * block, end = start + block;

    for (int i = start; i < end; ++i)
        *r += num[i];        
    return NULL;
}

Basically, foo() did an interlace scanning, on the other hand, bar() scan the array block-by-block.

Test result indicates that bar() is much faster:

gcc ping-pong.c -std=gnu99 -lpthread -O2  ; ./a.out
1.077037s
0.395525s

So how to interpretate this result?

The full source code is at: https://gist.github.com/4617935

Update: all if-statements removed

Zifei Tong
  • 1,697
  • 1
  • 19
  • 32
  • That if-statement looks eerily similar to that of the branch-predictor question. Is it even necessary to have it? – Mysticial Jan 24 '13 at 04:36
  • You may want to look at http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array – Rohan Jan 24 '13 at 04:36
  • @Rohan and @Mysticial, both `foo()` and `bar()` operate on a same randomized array. – Zifei Tong Jan 24 '13 at 04:38
  • Sequential memory access is almost always going to out-perform non-sequential access. I'm not sure what else to say. `foo()` is non-sequential, and `bar()` is sequential. I'm not sure what the purpose of the if-statement is. But if it randomly enters with a near 50% probability, then that could add "noise" to the results from branch mispredictions. – Mysticial Jan 24 '13 at 04:46
  • I just want to do something like the classic problem mentioned by @Rohan, to count the number of elements larger than a threshold in an array. – Zifei Tong Jan 24 '13 at 04:51
  • Oh ic. So it's an actual task and not a completely artificial benchmark. – Mysticial Jan 24 '13 at 04:53
  • Woah, that edit changes everything... It just made this question a lot more interesting. – Mysticial Jan 24 '13 at 04:56
  • @Mysticial, plz never mind the edit, I did something wrong. Really sorry for that. – Zifei Tong Jan 24 '13 at 05:01

2 Answers2

3

It turns out to be not mysterious at all.

I tried valgrind to profile cache misses, here is the result:

$ valgrind --tool=cachegrind --cachegrind-out-file=profile ./a.out
....
$ cg_annotate profile --auto=yes --show=D1mr --context=1
....
-- line 63 ----------------------------------------
    .  void* foo(void* ptr)
     0  {
     0      int* r = (int*) ptr;
     .  
     0      for (int i = (r - s); i < N; i += NUM_THREADS)
16,388          *r += num[i];
     0      return NULL;
     0  }
     .  
-- line 71 ----------------------------------------

-- line 72 ----------------------------------------
     .  void* bar(void* ptr)
     0  {
     0      int* r = (int*) ptr;
     0      int idx = r - s;
     0      int block = N/NUM_THREADS;
     0      int start = idx * block, end = start + block;
     .  
     0      for (int i = start; i < end; ++i)
 4,098          *r += num[i];
     0      return NULL;
     0  }

As you see, 4 times more L1 cache read misses of foo() than bar, and 4 is just the NUM_THREADS.

So as answered by @Mysticial

Sequential memory access is almost always going to out-perform non-sequential access.

Since more not-sequential memory access means more cache misses.

Zifei Tong
  • 1,697
  • 1
  • 19
  • 32
1

The reason why sequential accesses are much faster is not due to cache structure, but rather due to HW prefetching (which is related, but not the same). There are several "streaming" prefetchers that can recognize a stream or a stride based access pattern, and run ahead prefetching your data for you.

Some examples (Intel CPUs, but similar principals are commonly used on other CPUs as well) : http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers

The valgrind profiling suggested here demonstrates it, I would suggest looking also at the L2/L3 - on large data sets the useful prefetches are more likely to reside there (rule of thumb - the farther away from the core you are, the more time and storage you have available for aggressive prefetching).

Leeor
  • 19,260
  • 5
  • 56
  • 87