Intel compiler generates the following prefetch instruction within a loop for accessing an array by a_ptr
pointer:
400e93: 62 d1 78 08 18 4c 24 vprefetch0 [r12+0x80]
If I manually change (by hex-editing the executable) this to non-temporal prefetching:
400e93: 62 d1 78 08 18 44 24 vprefetchnta [r12+0x80]
the loop runs almost 1.5 times faster (!!!). However, I would prefer the compiler to generate non-temporal prefetching for me. I thought that
#pragma prefetch a_ptr:_MM_HINT_NTA
before the loop should do the trick, but it actually does not; it generates the very same instructions as withnout the pragma. Why icpc
ignores this pragma? How may I force it to generate non-temporal prefetchning?
Opt. report does not say anything useful as far as I see:
LOOP BEGIN at test-mic.cpp(56,5)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed ANTI dependence between b_ptr line 64 and b_ptr line 65
remark #15346: vector dependence: assumed FLOW dependence between b_ptr line 65 and b_ptr line 64
remark #25018: Total number of lines prefetched=2
remark #25019: Number of spatial prefetches=2, dist=29
remark #25021: Number of initial-value prefetches=2
remark #25139: Using second-level distance 2 for prefetching spatial memory reference [ test-mic.cpp(61,50) ]
remark #25015: Estimate of max trip count of loop=1048576
LOOP END