How do I gain measurable benefit from prefetch intrinsics?

Question

Using gcc 4.4.5 (yeah... I know it's old) on x86_64. Limited to SSE2 (or earlier) instructions for compatibility reasons.

I have what I think should be a textbook case for gaining big benefits from prefetching. I have an array ("A") of 32-bit elements, which are not (and cannot be) in sequential order. These 32-bit elements are indices into a larger data array ("D") of __m128i data. For each element of "A", I need to fetch the __m128i data from the appropriate location in "D", perform an operation on it, and store it back into the same location in "D". Actually each "entry" in D is "SOME_CONST" __m128i's big. So if the value in A is "1", the index into D is D[1 * SOME_CONST].

Since consecutive elements in "A" will almost never point to sequential locations in "D", I tend to think that the hardware prefetcher is going to struggle or fail to accomplish anything useful.

However, I can predict which locations I will be accessing next very easily, simply by looking ahead in "A". Enough verbiage... here's some code. The operation I'm performing on the data is to take the lower 64-bits of the __m128i and clone it to the upper 64-bits of the same. First the basic loop, no frills...

// SOME_CONST is either 3 or 4, but this "operation" only needs to happen for 3

for ( i=0; i<arraySize; ++i )
{
  register __m128i *dPtr = D + (A[i] * SOME_CONST);
  dPtr[0] = _mm_shuffle_epi32( dPtr[0], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr[1] = _mm_shuffle_epi32( dPtr[1], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr[2] = _mm_shuffle_epi32( dPtr[2], 0 | (1<<2) | (0<<4) | (1<<6) );

  // The immediate operand selects:
  // Bits 0-31   = bits 0-31
  // Bits 32-63  = bits 32-63
  // Bits 64-95  = bits 0-31
  // Bits 96-127 = bits 32-63

  // If anyone is more clever than me and knows of a better way to do this in SSE2,
  //  bonus points.  ;-)
}

I've tried a number of different ways to sprinkle prefetch instrinsics in there, and none have resulted in any kind of speedup yet. For example, I tried unrolling the loop to have a stride of 2 or even 4 elements, but it didn't help...

// Assume the "A" array size is appropriately padded so that overruns don't
//  result in SIGSEGV accessing uninitialized memory in D.

register __m128i *dPtr0, *dPtr1, *dPtr2, *dPtr3, *dPtr4, *dPtr5, *dPtr6, *dPtr7;
dPtr4 = D + (A[0] * SOME_CONST);
dPtr5 = D + (A[1] * SOME_CONST);
dPtr6 = D + (A[2] * SOME_CONST);
dPtr7 = D + (A[3] * SOME_CONST);

for ( i=0; i<arraySize; i+=4 )
{
  dPtr0 = dPtr4;
  dPtr1 = dPtr5;
  dPtr2 = dPtr6;
  dPtr3 = dPtr7;

  dPtr4 = D + (A[i+4] * SOME_CONST);
  _mm_prefetch( dPtr4, _MM_HINT_NTA );
  _mm_prefetch( dPtr4+1, _MM_HINT_NTA ); // Is it needed?  Tried both ways

  dPtr5 = D + (A[i+5] * SOME_CONST);
  _mm_prefetch( dPtr5, _MM_HINT_NTA );
  _mm_prefetch( dPtr5+1, _MM_HINT_NTA ); // Is it needed?  Tried both ways

  dPtr6 = D + (A[i+6] * SOME_CONST);
  _mm_prefetch( dPtr6, _MM_HINT_NTA );
  _mm_prefetch( dPtr6+1, _MM_HINT_NTA ); // Is it needed?  Tried both ways

  dPtr7 = D + (A[i+7] * SOME_CONST);
  _mm_prefetch( dPtr7, _MM_HINT_NTA );
  _mm_prefetch( dPtr7+1, _MM_HINT_NTA ); // Is it needed?  Tried both ways

  dPtr0[0] = _mm_shuffle_epi32( dPtr0[0], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr0[1] = _mm_shuffle_epi32( dPtr0[1], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr0[2] = _mm_shuffle_epi32( dPtr0[2], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr1[0] = _mm_shuffle_epi32( dPtr1[0], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr1[1] = _mm_shuffle_epi32( dPtr1[1], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr1[2] = _mm_shuffle_epi32( dPtr1[2], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr2[0] = _mm_shuffle_epi32( dPtr2[0], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr2[1] = _mm_shuffle_epi32( dPtr2[1], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr2[2] = _mm_shuffle_epi32( dPtr2[2], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr3[0] = _mm_shuffle_epi32( dPtr3[0], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr3[1] = _mm_shuffle_epi32( dPtr3[1], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr3[2] = _mm_shuffle_epi32( dPtr3[2], 0 | (1<<2) | (0<<4) | (1<<6) );
}

That is the 4 element version, but I also tried with only 2 in case that was too much data to be sloshing around. Also I tried using _MM_HINT_NTA and _MM_HINT_T0. No appreciable difference somehow.

I also tried a simpler variant, which just attempts to put as much space as seemed reasonable between the prefetch and when it would be used:

#define PREFETCH_DISTANCE 10
// trying 5 overnight, will see results tomorrow...

for ( i=0; i<arraySize; ++i )
{
  register __m128i *dPtrFuture, *dPtr;
  dPtrFuture = D + (A[i + PREFETCH_DISTANCE] * SOME_CONST);
  _mm_prefetch( dPtrFuture, _MM_HINT_NTA );      // tried _MM_HINT_T0 too
  _mm_prefetch( dPtrFuture + 1, _MM_HINT_NTA );  // tried _MM_HINT_T0 too

  dPtr = D + (A[i] * SOME_CONST);
  dPtr[0] = _mm_shuffle_epi32( dPtr[0], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr[1] = _mm_shuffle_epi32( dPtr[1], 0 | (1<<2) | (0<<4) | (1<<6) );
  dPtr[2] = _mm_shuffle_epi32( dPtr[2], 0 | (1<<2) | (0<<4) | (1<<6) );
}

Initially I expect this code to stall, but once it gets "PREFETCH_DISTANCE" in to the loop, I was hoping it would see a good enough speed boost. Most of these variants cause no more than .2 seconds of runtime difference over millions of iterations which take a total CPU time of 4m:30s on this particular machine (which is stone idle other than me). The differences seem to be indistinguishable from "noise" in the data.

Am I correct in my assumption that prefetching should be able to help me here? If so, what am I doing wrong?

All helpful and interesting thoughts are appreciated.

EDIT:

I created a contrived example that truly randomizes the data in A. I played with the buffer sizes from 64MB up to 6400MB. I found that I get enormous gain out of unrolling the loop and pre-calculating the addresses of the next 4 elements as I'm performing my operation on the current 4. But my runtime tanks by a factor of >10x if I try to prefetch any of the addresses that I pre-calculated. I'm really scratching my head on this one. My standalone contrived code is:

#include <xmmintrin.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>

#define QUEUE_ELEMENTS    1048576
#define DATA_ELEMENT_SIZE 4 * sizeof( __m128i )
#define DATA_ELEMENTS     QUEUE_ELEMENTS

#define QUEUE_ITERATIONS  100000
#define LOOP_UNROLL_4
#define LOOP_UNROLL_2

#ifdef LOOP_UNROLL_4
  #define UNROLL_CONST 4
#else
  #ifdef LOOP_UNROLL_2
    #define UNROLL_CONST 2
  #else
    #define UNROLL_CONST 1
  #endif
#endif

int main( void )
{
  unsigned long long randTemp;
  unsigned long i, outerLoop;
  unsigned long *workQueue;
  __m128i *data, *dataOrig;
  clock_t timeStamp;

  workQueue = malloc( QUEUE_ELEMENTS * sizeof( unsigned long ) );

  dataOrig = malloc( (DATA_ELEMENTS * DATA_ELEMENT_SIZE) + 2 );
  if ( (unsigned long long) dataOrig & 0xf )
  {
    data = (__m128i *) (((unsigned long long) dataOrig & ~0xf) + 0x10);
    // force 16-byte (128-bit) alignment
  } else data = dataOrig;

  // Not initializing data, because its contents isn't important.

  for ( i=0; i<QUEUE_ELEMENTS; ++i )
  {
    randTemp = (unsigned long long)rand() *
     (unsigned long long) QUEUE_ELEMENTS / (unsigned long long) RAND_MAX;
    workQueue[i] = (unsigned long) randTemp;
  }

  printf( "Starting work...\n" );
  // Actual work happening below... start counting.
  timeStamp = clock();

  for ( outerLoop = 0; outerLoop < QUEUE_ITERATIONS; ++outerLoop )
  {
    register __m128i *dataPtr0, *dataPtr1, *dataPtr2, *dataPtr3;
    register __m128i *dataPtr4, *dataPtr5, *dataPtr6, *dataPtr7;

    #ifdef LOOP_UNROLL_2
      dataPtr4 = data + (workQueue[0] * DATA_ELEMENT_SIZE);
      dataPtr5 = data + (workQueue[1] * DATA_ELEMENT_SIZE);
    #endif
    #ifdef LOOP_UNROLL_4
      dataPtr6 = data + (workQueue[2] * DATA_ELEMENT_SIZE);
      dataPtr7 = data + (workQueue[3] * DATA_ELEMENT_SIZE);
    #endif

    for ( i=0; i<QUEUE_ELEMENTS; i+=UNROLL_CONST )
    {
      #ifdef LOOP_UNROLL_2
        dataPtr0 = dataPtr4;
        dataPtr4 = data + (workQueue[i+4] * DATA_ELEMENT_SIZE);
        // _mm_prefetch( dataPtr4, _MM_HINT_T0 );
        dataPtr1 = dataPtr5;
        dataPtr5 = data + (workQueue[i+5] * DATA_ELEMENT_SIZE);
        // _mm_prefetch( dataPtr5, _MM_HINT_T0 );
      #endif
      #ifdef LOOP_UNROLL_4
        dataPtr2 = dataPtr6;
        dataPtr6 = data + (workQueue[i+6] * DATA_ELEMENT_SIZE);
        // _mm_prefetch( dataPtr6, _MM_HINT_T0 );
        dataPtr3 = dataPtr7;
        dataPtr7 = data + (workQueue[i+7] * DATA_ELEMENT_SIZE);
        // _mm_prefetch( dataPtr7, _MM_HINT_T0 );
      #endif
      #if !defined( LOOP_UNROLL_2 ) && !defined( LOOP_UNROLL_4 )
        dataPtr0 = data + (workQueue[i] * DATA_ELEMENT_SIZE);
      #endif

      _mm_shuffle_epi32( dataPtr0[0], 0 | (1<<2) | (0<<4) | (1<<6) );
      _mm_shuffle_epi32( dataPtr0[1], 0 | (1<<2) | (0<<4) | (1<<6) );
      _mm_shuffle_epi32( dataPtr0[2], 0 | (1<<2) | (0<<4) | (1<<6) );
      // Per original code, no need to perform operation on dataPtrx[3]

      #ifdef LOOP_UNROLL_2
        _mm_shuffle_epi32( dataPtr1[0], 0 | (1<<2) | (0<<4) | (1<<6) );
        _mm_shuffle_epi32( dataPtr1[1], 0 | (1<<2) | (0<<4) | (1<<6) );
        _mm_shuffle_epi32( dataPtr1[2], 0 | (1<<2) | (0<<4) | (1<<6) );
      #endif
      #ifdef LOOP_UNROLL_4
        _mm_shuffle_epi32( dataPtr2[0], 0 | (1<<2) | (0<<4) | (1<<6) );  
        _mm_shuffle_epi32( dataPtr2[1], 0 | (1<<2) | (0<<4) | (1<<6) );  
        _mm_shuffle_epi32( dataPtr2[2], 0 | (1<<2) | (0<<4) | (1<<6) );  
        _mm_shuffle_epi32( dataPtr3[0], 0 | (1<<2) | (0<<4) | (1<<6) );  
        _mm_shuffle_epi32( dataPtr3[1], 0 | (1<<2) | (0<<4) | (1<<6) );  
        _mm_shuffle_epi32( dataPtr3[2], 0 | (1<<2) | (0<<4) | (1<<6) );  
      #endif
    }
    if ( (outerLoop % 1000) == 0 ) { putchar( '.' ); fflush( stdout ); }
  }

  timeStamp = clock() - timeStamp;
  printf( "\nRun was %lu seconds.\n", timeStamp / CLOCKS_PER_SEC );

  free( dataOrig );
  free( workQueue );

  return 0;
}

My runtime with LOOP_UNROLL_2 is 40 seconds.
My runtime with LOOP_UNROLL_4 is 20 seconds.
My runtime without any unrolling is 80 seconds.
When I enable the prefetches, the runtime is so long that I just kill the process rather than wait for it.

I even wrote an 8x unrolled loop, and it still scaled perfectly to 10 seconds of runtime. I'm amazed it didn't saturate there because I would certainly run out of registers at that point, holding 16 unique pointers. So what can I learn from this? That my inner loop code is so tight that it is greatly overshadowed by the overhead in the loop construct itself? Are there any bugs in this code that I'm not seeing? All builds were with gcc -O2.

In general explicit prefetching won't help you on modern CPUs apart from in a few very specific cases, and even then it's very tricky to get it right. It can even do more harm than good if you get it wrong. Note also that there are a lot of very similar (possibly duplicate) questions and answers about this here on SO already, which you might want to read. See e.g. http://stackoverflow.com/questions/23741246/why-is-prefetch-speedup-not-greater-in-this-example — Paul R, Sep 20 '14 at 08:59
Understood, Paul. I have searched those questions here, and mostly the answer was "don't do it because the hardware will do it better" (and rightly so in those cases). But I'd like to understand why this particular case can't benefit, given the fairly "random-looking" (yet entirely human-predictable) access that it needs to do into the larger array. I'm certain the hardware can't understand that access pattern on its own. — Marty, Sep 20 '14 at 09:08
A data transpose is one of the only times I've ever gotten a noticeable speedup from manual prefetch: http://stackoverflow.com/questions/7327994/prefetching-examples — Mysticial, Sep 20 '14 at 09:11
You assert that the hardware can't understand the access pattern, yet you have strong proof that it does. Do you *really* need confirmation that your assumption was wrong? — Hans Passant, Sep 20 '14 at 10:40
How large is `arraySize`? I suspect to get any difference from prefetch with your code you need to use very large data sets (several gigabytes). Otherwise effect of prefetch may be hidden by cache. — Evgeny Kluev, Sep 20 '14 at 15:24
Hans, I was seeking confirmation that my methods for prefetching were sound, since I had never done it before. I'm either going about it the wrong way and I'm still getting the same number of cache misses, or I'm doing it right and the hardware can somehow anticipate the double dereference to prefetch the right data. — Marty, Sep 20 '14 at 17:11
Evgeny, the A array size can vary widely. It can be 1 element or over 990,000 elements. Each entry expands to 3 or 4 (value of SOME_CONST) times 16 bytes. So roughly 0-64MB. It could be that most cases are skewed to the low side, with only a few reaching the max size and you are right. I will try to create a test case and see. — Marty, Sep 20 '14 at 17:20
Is your array SSE aligned? The seconds prefetch in most cases will ask the CPU to prefetch the same cache line (64B). This wastes CPU cycles — egur, Oct 01 '14 at 21:07
@egur : It is aligned to 16 bytes, and each entry is either 48 or 64 bytes (same for all entries, determined during initialization). So in the case with 48 bytes it will be less efficient as you mention. — Marty, Oct 01 '14 at 23:13
Which type is the CPU? Have you tried disabling the HW prefetchers (through BIOS, if possible), and see if the SW prefetching still loses as much? Some time the SW prefetches can confuse some of the HW ones. Also, it could be that your prefetches thrash too much useful data out of the cache — Leeor, Oct 03 '14 at 09:27
@Leeor : Unfortunately I don't administer the system, so I can't make invasive changes like that. I'm sure you're right that i'm mishandling something and thrashing the cache, because just having extra do-nothing instructions in the loop wouldn't account for >10x degradation. I just don't understand where I went wrong. If anyone else with ideas could compile the self-contained code at the bottom it might help us get to the bottom of it. For now, in my production code, I'm just going to rely on the hardware. — Marty, Oct 04 '14 at 17:42

score 1 · Answer 1 · answered Dec 02 '14 at 06:16

If your data resides in memory, don't expect much of a speedup; prefetching from memory has very little available to improve.

With 150 ns cycle time time, 64 byte cache lines and 4GB/sec streaming transfer rate (my AMD system; Intel is faster), and using 48 bytes (3 x 128 bits) of each 64-byte cache line read, the system is fetching 320 MB of usable data per second. Prefetching brings the rate closer to the peak of 4000 MB/s. The total savings available to prefetching is .92 seconds for every 320 MB read. At 320 MB/s, 270 seconds (4m 30s) is 840 GB worth of memory transfer time; the progran is probably not spending more than a tiny fraction of this (<1%, 8GB) on actually reading memory. Completely eliminating memory i/o would save... 1% of the runtime.

The drawback to too much prefetching is that prefetched data displaces useful things from the really fast but really small memory close to the cpu (level 1 and level 2 caches, kilobytes not megabytes). This may account for the pessimal performance of some of the test runs.

That unrolling the loop sped up the program but prefetching didn't also suggests that the processing itself is the primary bottleneck, not the data movement.

How do I gain measurable benefit from prefetch intrinsics?

1 Answers1