8

Intel compiler generates the following prefetch instruction within a loop for accessing an array by a_ptr pointer:

400e93:       62 d1 78 08 18 4c 24    vprefetch0 [r12+0x80]

If I manually change (by hex-editing the executable) this to non-temporal prefetching:

400e93:       62 d1 78 08 18 44 24    vprefetchnta [r12+0x80]

the loop runs almost 1.5 times faster (!!!). However, I would prefer the compiler to generate non-temporal prefetching for me. I thought that

#pragma prefetch a_ptr:_MM_HINT_NTA

before the loop should do the trick, but it actually does not; it generates the very same instructions as withnout the pragma. Why icpc ignores this pragma? How may I force it to generate non-temporal prefetchning?

Opt. report does not say anything useful as far as I see:

LOOP BEGIN at test-mic.cpp(56,5)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed ANTI dependence between b_ptr line 64 and b_ptr line 65
   remark #15346: vector dependence: assumed FLOW dependence between b_ptr line 65 and b_ptr line 64
   remark #25018: Total number of lines prefetched=2
   remark #25019: Number of spatial prefetches=2, dist=29
   remark #25021: Number of initial-value prefetches=2
   remark #25139: Using second-level distance 2 for prefetching spatial memory reference   [ test-mic.cpp(61,50) ]
   remark #25015: Estimate of max trip count of loop=1048576
LOOP END
Daniel Langr
  • 22,196
  • 3
  • 50
  • 93

1 Answers1

4

This is a known issue - the BKM is to use explicit values 0,1,2,3 for hints (t0, t1, t2, nta) in the prefetch directives/pragmas (and NOT use the MM_HINT enum).

This is because the MM_HINT enum in the header files map differently:

/* constants to use with _mm_prefetch  (extracted from *mmintrin.h) */
#define _MM_HINT_T0 1
#define _MM_HINT_T1 2
#define _MM_HINT_T2 3
#define _MM_HINT_NTA    0    <--maps here
#define _MM_HINT_ENTA   4
#define _MM_HINT_ET0    5
#define _MM_HINT_ET1    6
#define _MM_HINT_ET2    7

Plus the Intel headers and gcc headers use different enum values - that is also troublesome. So the hint --enums are to be used only for the _mm_prefetch intrinsics, NOT for the prefetch directives.

For this example, you should be able to use: #pragma prefetch a_ptr:3

However, that suggested syntax is not currently usable due to a defect where the compiler is currently unable to properly connect the a_ptr load memory-ref inside the loop with the expression in the prefetch directive; therefore, a temporary solution is to use the following syntax:

#pragma prefetch *:3

Note: The asterisk means the directive will apply for "ALL" memory refs inside the loop. In this loop, b_ptr cannot be prefetched by the compiler anyway - since it is not a linear address expression. So the "*" applies only to a_ptr anyway here - and leads to vprefetchnta (on both KNC and KNL).

The defect will be fixed in a future release.

K. Davis
  • 56
  • 3
  • Unfortunately, still the same result. I even tried `#pragma noprefetch a_ptr` (according to this link: https://software.intel.com/en-us/node/524554) and the compiler still generates prefetch instructions for me. I also tried latest Intel 17.0.0 compiler, same outcome. You can find my entire code here: https://github.com/DanielLangr/ntload/blob/master/test-mic.cpp. – Daniel Langr Nov 04 '16 at 07:55
  • @BeeOnRobe - Yes, the hints are valid on IA-32 w/Intel compiler (see [link] https://software.intel.com/en-us/node/683920). Daniel Langr - Please permit me time to consult w/others on your code and I'll get back to you. – K. Davis Nov 04 '16 at 15:57
  • @DanielLangr - For your code the guidance is to use "*" instead of "a_ptr" in the prefetch directive as follows: #pragma prefetch *:3 Note that "*" means that the directive will apply for "ALL" memory refs inside the loop. In this loop, b_ptr cannot be prefetched by the compiler anyway - since it is not a linear address expression. So the "*" applies only to a_ptr anyway here - and leads to vprefetchnta (on both KNC and KNL). – K. Davis Nov 11 '16 at 19:44
  • Nice, it works, thanks a lot. Any idea why the pragma with `a_ptr` does not work and with `*` does? Anyway, could you please update your answer so I could mark it as accepted? – Daniel Langr Nov 14 '16 at 14:15
  • I edited the answer. Please pardon any mis-steps in editing/posting as I’m new to stackoverflow. Thank you. – K. Davis Nov 16 '16 at 16:47