0

According to https://software.intel.com/sites/landingpage/IntrinsicsGuide

prefetcht0, prefetcht1, prefetcht2 and prefetchnta

Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i.

I'm sure a "line of data" is obvious to someone familiar with the context but to me its a mystery. If I provide a pointer to some data to prefetch, what is the amount that will be fetched? 4B? 64B? 1KB?

If I intend to read 32B from that address later and it prefetches only 16B, should I prefetch multiple times with offsets?

Community
  • 1
  • 1
user81993
  • 6,167
  • 6
  • 32
  • 64
  • you have what I call a fetch line and then a cache line, and they are two different things, some processors (ones with pipelines in particular) will fetch some chunk like 32 or 64 bytes in one fetch, since the busses are more complicated with multiple handshakes to move the data so more, aligned, data is more efficient, and you can stuff the pipe better. a cache line is related to a similar transfer size between the cache and the main/slow memory. and is often larger. But there is no rule on what size, it is up to each implementation. – old_timer May 30 '17 at 21:57
  • Based on some web searches about this, prefetching offers little or no gain on more recent processors, as the hardware prefetches on the more recent processors do a good job, unless you can prefetch somewhat in advance of actually needing the data, and even in this case, eventually a loop could catch up to memory bandwidth limits. Prefetching was more useful back in the days of the Pentium 4. There's a prior thread at SO about [prefetching](https://stackoverflow.com/questions/19472036/does-software-prefetching-allocate-a-line-fill-buffer-lfb) . – rcgldr Jun 01 '17 at 02:03
  • @rcgldr I'd say it has quite the impact still. I was optimizing an algorithm that is heavily bottlenecked by random I/O, I tried solving multiple instances of the algorithm within a single thread in addition to multi threading but that ended up being basically the same performance wise. Adding prefetch to the mix gave me a nice ~30% performance bump though, this was on a sandy bridge processor in ml64 – user81993 Jun 02 '17 at 22:35
  • @user81993 - You're experience is different than what the others mentioned about this. In this [500+ line assembly code](https://github.com/01org/isa-l/blob/master/crc/crc16_t10dif_01.asm) for highly optimized generation of CRC16 (or CRC32, only the constants change), which reads 128 bytes at a time, no prefetch instructions are used, If your program was random I/O bound, then how did a memory prefetch help? Was the prefetch done on buffers before each read? – rcgldr Jun 02 '17 at 23:45
  • @rcgldr Since the separate instances within a thread had nothing to do with each other, I could calculate the required data address for each of them for the next loop iteration, issue the prefetch command for all and then continue solving them one at a time. My theory is that as it got further down the line, it would become more and more likely for the required data to already be in the cache so it didn't have to wait for those. I got the best results while doing 8 instances per thread, after that it seemed I hit another wall of some sort, since additional instances didn't improve the speed. – user81993 Jun 03 '17 at 16:50

1 Answers1

0

This refers to a cache line which is 64 bytes on contemporary CPU architectures.

Prefetching more often than necessary should not incur a penalty as the CPU checks if the datum is already in the cache before caching it again. However, caching data you won't need in the near future can incur a performance penalty because you evict other data from the cache and consume load ports on the CPU that could be used to load other data.

When in doubt, benchmark.

fuz
  • 88,405
  • 25
  • 200
  • 352