The motivation of this quesion is to understand how software memory prefetching affects my program.
I'm building a multi-threaded data partitioner. Each thread sequencially read over a local source array and randomly write to another local destination array. As the content of the source array won't be used in near future, I'd like to use prefetchtnta
instruction to avoid them growing inside caches. On the other hand, each thread has a local write combiner that combines writes and commits to the local destination array using _mm_stream_si64
. The intuition and goal is to make sure each thread has a fixed size of data cache to work with and never being occupied by unused bits.
Is this design reasonable? I'm not familiar of how CPU works and cannot be sure if this strategy actually disables hardware prefetchers that presumably invalidate this approach. If this is just me being naive, what's the right way to achieve this goal?