2

The motivation of this quesion is to understand how software memory prefetching affects my program.

I'm building a multi-threaded data partitioner. Each thread sequencially read over a local source array and randomly write to another local destination array. As the content of the source array won't be used in near future, I'd like to use prefetchtnta instruction to avoid them growing inside caches. On the other hand, each thread has a local write combiner that combines writes and commits to the local destination array using _mm_stream_si64. The intuition and goal is to make sure each thread has a fixed size of data cache to work with and never being occupied by unused bits.

Is this design reasonable? I'm not familiar of how CPU works and cannot be sure if this strategy actually disables hardware prefetchers that presumably invalidate this approach. If this is just me being naive, what's the right way to achieve this goal?

Amos
  • 3,238
  • 4
  • 19
  • 41
  • It is architecture specific (perhaps even micro-processor specific, i.e. different on AMD and on Intel), compiler specific, and I would leave such very low micro-optimizations to the compiler. See also [this](https://stackoverflow.com/a/29203501/841108). So I don't think the design is reasonable. – Basile Starynkevitch Jan 04 '19 at 07:36
  • OK, but this question is the opposite of what prefetching usually trying to optimize. I'd like to make the array reading streams affect the cache as little as possible. At least there is not `prefetching distance` needed to empirically guessed :) – Amos Jan 04 '19 at 07:45
  • 1
    Semi-related: [Non-temporal loads and the hardware prefetcher, do they work together?](https://stackoverflow.com/q/32103968). But that's about `movntdqa`, which acts like `movdqa` when used on WB memory. – Peter Cordes Jan 04 '19 at 08:57
  • You could use model specific registers to disable prefetchers, e.g., with [`likwid-features`](https://github.com/RRZE-HPC/likwid/wiki/likwid-features). – como Feb 01 '19 at 12:19

0 Answers0