7

I am writing a program to parse a file. It consists of a main loop that parses character by character and treats them. Here is the main loop:

char c;
char * ptr;

for( size_t i = 0; i < size ; ++i )
{
    ptr = ( static_cast<char*>(sentenceMap) + i );
    c = *ptr;

    __builtin_prefetch( ptr + i + 1 );

   // some treatment on ptr and c   
}

As you can see, I added a builtin_prefetch instruction, hoping to put in cache the next iteration of my loop. I tried with different values : ptr+i+1, ptr+i+2, ptr+i+10 but nothing seems to change.

To measure performance, I use valgrind’s tool cachegrind, which gives me an indication of the number of cache misses. On the line c = *ptr, cachegrind records 632,378 DLmr (L3 cache miss) when __builtin_prefetch is not set. What’s weird though, is that this value does not change, regardless of the parameter I set to __builtin_prefetch.

Any explanation to that?

Mysticial
  • 464,885
  • 45
  • 335
  • 332
qdii
  • 12,505
  • 10
  • 59
  • 116

3 Answers3

12

That's because the hardware is years ahead of you. :)

There are hardware prefetchers that are designed to recognize simple patterns and do the prefetching for you. In this case, you have a simple sequential access pattern, that's more than trivial for the hardware prefetcher.

Manual prefetching only comes handy when you have access patterns that the hardware cannot predict.

Here's one such example: Prefetching Examples?

Community
  • 1
  • 1
Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • Well that’s good news :) I was watching the disassembled code generated by GCC and it does just completely ignore the prefetch instruction. I guess you are right about that. – qdii Sep 19 '12 at 04:40
  • 4
    GCC actually removed the prefetch that you put in? I'm impressed! – Mysticial Sep 19 '12 at 04:42
  • Actually, no, let me take that back. I found the prefetch ASM instruction hidden after some scrolling down the code. Optimized code is messy :) – qdii Sep 19 '12 at 04:45
  • Side question, is there something I can do to prevent those cache misses? – qdii Sep 19 '12 at 04:54
  • That would depend on the calling code. If the data isn't in the cache, then those misses are unavoidable anyway. In any case, the definition of a cache-miss gets blurry at this point. I don't know whether hardware-counters will count a successful prefetch as a cache miss. – Mysticial Sep 19 '12 at 04:59
3

First of all smallest unit a cache deals is called cache line and a cache line can be for example 64 byte long, but never as small as 1 byte. So when you ask for prefetch you need to ask a lot ahead of your current interest location. You need to know the cache line size, so you shouldn't ask for addresses located in same cache line. You also need to not to call prefetch too many times as that might also evict soon to be used cache lines as well as create a performance hit as instructions executed.

Modern architectures also have a concept of hardware prefetchers, which according to your access pattern can prefetch data for you in advance. This should be most of the time create data access times as good as your simple prefetches. Nowadays SW prefetching can only help you, if you can find a place so obvious to prefetch data - not by randomly spreading in the code. For example before starting processing a piece of data, but this can't help you if you just call prefetch and access data immediately. You need to do this early enough and do other setup work before accessing your data.

I suggest anyone interested in such topic to read The Software Optimization Cookbook. I generally deal with ARM architecture but I found this book invaluable. There are also some excerpts related to this question available online; see #1 and #2.

auselen
  • 27,577
  • 7
  • 73
  • 114
1

The right answer is: prefetching cannot change number of cache misses, it just forces them to occur earlier :)

Bulat
  • 2,435
  • 1
  • 15
  • 15