Impacts of CPU cache on speed

Question

I just wrote a program to test the impact of CPU cache on speed performance.

void* foo(void* ptr)
{
    int* r = (int*) ptr;

    for (int i = (r - s); i < N; i += NUM_THREADS)
        *r += num[i];        
    return NULL;
}

void* bar(void* ptr)
{
    int* r = (int*) ptr;
    int idx = r - s;
    int block = N/NUM_THREADS;
    int start = idx * block, end = start + block;

    for (int i = start; i < end; ++i)
        *r += num[i];        
    return NULL;
}

Basically, foo() did an interlace scanning, on the other hand, bar() scan the array block-by-block.

Test result indicates that bar() is much faster:

gcc ping-pong.c -std=gnu99 -lpthread -O2  ; ./a.out
1.077037s
0.395525s

So how to interpretate this result?

The full source code is at: https://gist.github.com/4617935

Update: all if-statements removed

That if-statement looks eerily similar to that of the branch-predictor question. Is it even necessary to have it? — Mysticial, Jan 24 '13 at 04:36
You may want to look at http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array — Rohan, Jan 24 '13 at 04:36
@Rohan and @Mysticial, both `foo()` and `bar()` operate on a same randomized array. — Zifei Tong, Jan 24 '13 at 04:38
Sequential memory access is almost always going to out-perform non-sequential access. I'm not sure what else to say. `foo()` is non-sequential, and `bar()` is sequential. I'm not sure what the purpose of the if-statement is. But if it randomly enters with a near 50% probability, then that could add "noise" to the results from branch mispredictions. — Mysticial, Jan 24 '13 at 04:46
I just want to do something like the classic problem mentioned by @Rohan, to count the number of elements larger than a threshold in an array. — Zifei Tong, Jan 24 '13 at 04:51
Oh ic. So it's an actual task and not a completely artificial benchmark. — Mysticial, Jan 24 '13 at 04:53
Woah, that edit changes everything... It just made this question a lot more interesting. — Mysticial, Jan 24 '13 at 04:56
@Mysticial, plz never mind the edit, I did something wrong. Really sorry for that. — Zifei Tong, Jan 24 '13 at 05:01

score 3 · Accepted Answer · answered Jan 24 '13 at 05:57

It turns out to be not mysterious at all.

I tried valgrind to profile cache misses, here is the result:

$ valgrind --tool=cachegrind --cachegrind-out-file=profile ./a.out
....
$ cg_annotate profile --auto=yes --show=D1mr --context=1
....
-- line 63 ----------------------------------------
    .  void* foo(void* ptr)
     0  {
     0      int* r = (int*) ptr;
     .  
     0      for (int i = (r - s); i < N; i += NUM_THREADS)
16,388          *r += num[i];
     0      return NULL;
     0  }
     .  
-- line 71 ----------------------------------------

-- line 72 ----------------------------------------
     .  void* bar(void* ptr)
     0  {
     0      int* r = (int*) ptr;
     0      int idx = r - s;
     0      int block = N/NUM_THREADS;
     0      int start = idx * block, end = start + block;
     .  
     0      for (int i = start; i < end; ++i)
 4,098          *r += num[i];
     0      return NULL;
     0  }

As you see, 4 times more L1 cache read misses of foo() than bar, and 4 is just the NUM_THREADS.

So as answered by @Mysticial

Sequential memory access is almost always going to out-perform non-sequential access.

Since more not-sequential memory access means more cache misses.

Now *that* is a shocking conclusion... – CAFxX Jan 24 '13 at 08:29 — CAFxX, Jan 24 '13 at 08:29

score 1 · Answer 2 · answered Jan 28 '13 at 21:41

The reason why sequential accesses are much faster is not due to cache structure, but rather due to HW prefetching (which is related, but not the same). There are several "streaming" prefetchers that can recognize a stream or a stride based access pattern, and run ahead prefetching your data for you.

Some examples (Intel CPUs, but similar principals are commonly used on other CPUs as well) : http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers

The valgrind profiling suggested here demonstrates it, I would suggest looking also at the L2/L3 - on large data sets the useful prefetches are more likely to reside there (rule of thumb - the farther away from the core you are, the more time and storage you have available for aggressive prefetching).

Impacts of CPU cache on speed

2 Answers2