Will _mm512_mask_prefetch_i32gather_ps() prefetch an entire cache line for each element?

Question

The gather prefetch intrinsic _mm512_mask_prefetch_i32gather_ps can be used to prefetch 32 bit floats on Knights Corner.

Since a corresponding intrinsic for doubles does not exist, how should this intrinsic be used for prefetch 64 or 128 bit elements?

Does each 4 byte chunk needed to be explicitly prefetched, or can we assume that each prefetch of a 32 bit variable will actually prefetch the entire 64 byte cache line that it occupies?

Example:

I want to prefetch 4 doubles at offsets {1,2,10,12} from base address 0xf0000000.

This corresponds to addresses of {0xf0000008, 0xf0000010, 0xf0000050, 0xf0000060}.

These occupy two cache lines starting at {0xf0000000, 0xf0000040}.

Would it be sufficient to use _mm512_mask_prefetch_i32gather_ps with the base addresses of these two cache lines?

I originally posted this question on the Intel MIC forum without success.

You might want to take a look at http://arxiv.org/abs/1401.7494. The authors believe that the `vgatherpf0hintdps` instruction is actually a no-op, and that `vgatherpf0dps` blocks until the load is complete (see section 6.4). If these assertions are correct, the gather prefetch instructions on the current generation of the Intel Phi are essentially useless. — pburka, Jul 08 '14 at 16:06
Thanks for the link. There is a lot of very relevant information in there for me. If they are correct, then it looks gather prefetching will not help and scalar prefetching will improve L1 hit ratio but add a massive overhead for moving the indexes via the stack to scalar registers. — amckinley, Jul 08 '14 at 19:35

Will _mm512_mask_prefetch_i32gather_ps() prefetch an entire cache line for each element?

0 Answers0