16

In my application, at one point I need to perform calculations on a large contiguous block of memory data (100s of MBs). What I was thinking was to keep prefetching the part of the block my program will touch in future, so that when I perform calculations on that portion, the data is already in the cache.

Can someone give me a simple example of how to achieve this with gcc? I read _mm_prefetch somewhere, but don't know how to properly use it. Also note that I have a multicore system, but each core will be working on a different region of memory in parallel.

Paul R
  • 208,748
  • 37
  • 389
  • 560
pythonic
  • 20,589
  • 43
  • 136
  • 219
  • 8
    If the memory access is sequential, the hardware prefetcher will already do it for you. So you probably won't get much improvement with manual prefetching. – Mysticial Apr 25 '12 at 20:42
  • 8
    See this question for an example of where prefetching actually helps: http://stackoverflow.com/questions/7327994/prefetching-examples – Mysticial Apr 25 '12 at 20:43
  • 2
    You mean the hardware prefetcher somehow recognizes I'm utilizing contiguous areas in memory and bring those portions in cache? – pythonic Apr 25 '12 at 20:43
  • 5
    Correct, the hardware prefetcher is capable of recognizing basic access patterns. – Mysticial Apr 25 '12 at 20:44

2 Answers2

19

gcc uses builtin functions as an interface for lowlevel instructions. In particular for your case __builtin_prefetch. But you only should see a measurable difference when using this in cases where the access pattern is not easy to predict automatically.

Jens Gustedt
  • 76,821
  • 6
  • 102
  • 177
17

Modern CPUs have pretty good automatic prefetch and you may well find that you do more harm than good if you try to initiate software prefetching. There is most likely a lot more "low hanging fruit" that you can focus on for optimisation if you find that you actually have a performance problem. Prefetch tends to be one of the last things that you might try, when you're desperate for a few more percent throughput.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 4
    +1 I've attempted prefetching on at least 10 different occasions. Only once did I even manage to get a noticable speedup. (the one I linked in the comments.) – Mysticial Apr 25 '12 at 20:47
  • 4
    Agreed - even on older CPUs with less sophisticated automatic prefetching it was always tough to get any benefit from software prefetch - the main problems being that you typically need to initiate prefetch a few hundred clock cycles ahead of time and of course you need to have some spare memory bandwidth that you can take advantage of, which is often not the case in high performance code. – Paul R Apr 25 '12 at 20:51
  • 1
    Prefetch is not necessary - until it is necessary. In my current application - memory access patterns were not spotted by the hardware pre-fetcher. And unfortunately - changing those access patterns to be more pre-fetcher friendly was not an option. Hence - _mm_prefetch. Throughput went down by ~10%, but we achieved the latency numbers that we wanted. It was a very conscious trade off that was done after much profiling via perf and vtune. – quixver Nov 05 '15 at 02:00