The prefetch instruction

Question

It appears the general logic for prefetch usage is that prefetch can be added, provided the code is busy in processing until the prefetch instruction completes its operation. But, it seems that if too much of prefetch instructions are used, then it would impact the performance of the system. I find that we need to first have the working code without prefetch instruction. Later we need to various combination of prefetch instruction in various locations of code and do analysis to determine the code locations that could actually improve because of prefetch. Is there any better way to determine the exact locations in which the prefetch instruction should be used ?

You might want to say which CPU you mean – John Saunders Jun 26 '10 at 06:07 — John Saunders, Jun 26 '10 at 06:07
I am looking for ARM and MIPS architecture based CPUs . – Karthik Balaguru Jun 26 '10 at 06:11 — Karthik Balaguru, Jun 26 '10 at 06:11
To be specific - It is ARM core of OMAP 5912 and for PIC32 – Karthik Balaguru Jun 26 '10 at 06:59 — Karthik Balaguru, Jun 26 '10 at 06:59

Paul R · Accepted Answer · 2015-02-23T12:48:34.083

19

In the majority of cases prefetch instructions are of little or no benefit, and can even be counter-productive in some cases. Most modern CPUs have an automatic prefetch mechanism which works well enough that adding software prefetch hints achieves little, or even interferes with automatic prefetch, and can actually reduce performance.

In some rare cases, such as when you are streaming large blocks of data on which you are doing very little actual processing, you may manage to hide some latency with software-initiated prefetching, but it's very hard to get it right - you need to start the prefetch several hundred cycles before you are going to be using the data - do it too late and you still get a cache miss, do it too early and your data may get evicted from cache before you are ready to use it. Often this will put the prefetch in some unrelated part of the code, which is bad for modularity and software maintenance. Worse still, if your architecture changes (new CPU, different clock speed, etc), such that DRAM access latency increases or decreases, you may need to move your prefetch instructions to another part of the code to keep them effective.

Anyway, if you feel you really must use prefetch, I recommend #ifdefs around any prefetch instructions so that you can compile your code with and without prefetch and see if it is actually helping (or hindering) performance, e.g.

#ifdef USE_PREFETCH
    // prefetch instruction(s)
#endif

In general though, I would recommend leaving software prefetch on the back burner as a last resort micro-optimisation after you've done all the more productive and obvious stuff.

edited Feb 23 '15 at 12:48

answered Jun 26 '10 at 07:05

Paul R

208,748
37
389
560

1

True, there are also cases where lot of prefetch might be bad. That is, suppose if data w,x,y,z are prefetched in the order that is required by software. But, it may happen that z can evict y from cache due to the small size of cache memory and thus y might not be available even though it was prefetched :( Thx for highlighting problems due to prefetch w.r.t change in CPU/Clock Speed as they impact the access(latency) and hence prefetch location.Yeah, the problems related with prefetch(software) use are difficult to be put down. But, how to get the right locations of prefetch in an easy way ? – Karthik Balaguru Jun 26 '10 at 07:21
2

There is no easy way, in my experience, and in most cases the effort is not justified. You can get much more optimisation "bang per buck" from improving your algorithm and its implementation, paying attention to cache usage and memory access patterns, using SIMD, etc. – Paul R Jun 26 '10 at 07:57
@Paul R: I disagree. After profiling cache misses I used prefetch to speed up one of my applications quite a bit: about 5-10%. – Zan Lynx Feb 25 '11 at 22:53
1

@Zan Lynx: did you test with different CPUs and different clock speeds ? Older CPUs might benefit from manual prefetch but with more modern CPYUs it's hard to improve upon automatic prefetch. It's also somewhat dependent on CPU speed and memory bandwidth/latency/etc, so an apparent improvement on one configuration may not work well for others. – Paul R Feb 27 '11 at 20:47
2

@Paul R: Both Core2 and Itanium benefit from prefetch. I didn't eliminate fetch delay but it does reduce the number of wait cycles in a tree search. – Zan Lynx Feb 28 '11 at 06:33
2

@Zan Lynx: there may be some cases that benefit, but it's important to remember that you can only ever take advantage of prefetch *if you have memory bandwidth to spare* - in many cases (probably the majority) all that you achieve with a manual prefetch is that you take away bandwidth from some other part of your program. But if it works for you in your particular case then great. – Paul R Feb 28 '11 at 09:03
1

If you're wondering why you got several random upvotes on this even though it's an old question, it's because it got linked to from [this biggie](http://stackoverflow.com/questions/8547778/why-is-one-loop-so-much-slower-than-two-loops). :) – Mysticial Dec 22 '11 at 18:13

score 6 · Answer 2 · answered May 02 '12 at 00:08

To even consider prefetching code performance must already be an issue.

1: use a code profiler. Trying to use prefetch without a profiler is a waste of time.

2: whenever you find an instruction in a critical place that is anomalously slow, you have a candidate for a prefetch. Often the actual problem is on the memory access on the line before the slow one, rather than the slow one as indicated by the profiler. Work out what memory access is causing the problem (not always easy) and prefetch it.

3 Run your profiler again and see if it made any difference. If it didn't take it out. On occasion I have sped up loops by >300% this way. It's generally most effective if you have a loop accessing memory in a non-sequential way.

I Disagree completely about it being less useful on modern CPU's, I have found completely the opposite, though on older CPU's prefetching about 100 instructions was optimal, these day's I'd put that number more like 500.

score 2 · Answer 3 · answered Jul 16 '10 at 21:26

Sure, you have to experimate a bit, but not that you need to fetch somme houndred cycles (100-300) before the data is needed. The L2 cache is big enougth that the prefetched data can stay there a while.

This prefetching is very efficient in front of a a loop (a few houndred cycles of course), especialy if it is the inner loop and the loop is started thousand and more times per secound.

Also for ur ow fast LL implementation or a Tree-implementation could prefetching gain an measurable advantage because the CPU don't know jet that the data is needed soon.

But remember that the prefetching instruction eat some decoder/queue bandwidth so overusing them hurts performance because of that reason.

Yes I found prefetch very useful in a B+Tree search. The binary search in the block takes enough time in compare and branch misses that the data is ready in L2 not quite when needed, but earlier than without prefetch. — Zan Lynx, Feb 25 '11 at 22:55

The prefetch instruction

3 Answers3

Linked