Xeon CPU (E5-2603) backward memory prefetch

Question

Is backward memory prefetch as fast as forward memory prefetch in a Xeon CPU (E5-2603)?

I want to implement an algorithm that requires both a forward loop and a backward loop over data.

Since each iteration requires result from last iteration, I can't reverse the order of the loops.

sandy bridge architecture: p.45 of https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf — R zu, Aug 20 '18 at 18:13
p.60: L1 data cache can only prefetch forward. L2 and last level cache can prefetch forward/backward. So, don't expect L1 cache performance for the backward loop ... unless I use software prefetch. — R zu, Aug 20 '18 at 18:20
On Haswell/Skylake, I got good results for having one loop go forwards, then the next loop over the same data go backwards, for sizes only a bit too large to fit in cache. Changing direction means you're coming back over data that's hot in cache from the previous loop. — Peter Cordes, Aug 21 '18 at 01:15
Each piece of data in my case is larger than 1GB even after some compression. I seem to remember reading that gcc 4.8+ would use software prefetch in some cases. I hope the newest version of gcc would do this for me automatically in this case. The same algorithm would run many times over many pieces of data. — R zu, Aug 21 '18 at 01:46
If you're repeatedly touching a 1GB working set, look into cache-blocking: run multiple steps / passes over 128kiB or so (half L2 size), so it stays hot in L2 for each pass. google cache blocking / loop tiling for more. (There's an example of a tiled matmul in [What Every Programmer Should Know About Memory?](https://stackoverflow.com/a/47714514)). — Peter Cordes, Aug 21 '18 at 05:23
gcc supports data prefetching using `-fprefetch-loop-arrays`. Also, you can use prefetching manually in your code using the portable `__builtin_prefetch`. — Hadi Brais, Aug 21 '18 at 05:35

Hadi Brais · Accepted Answer · 2018-08-29T22:27:25.510

You can run experiments to determine whether the data prefetchers are able to handle forward sequential accesses and backward sequential accesses. I have a Haswell CPU and so the prefetchers might be different from those implemented in your CPU (Sandy Bridge).

The following graph shows the per-element access observable latencies when traversing an array in four different ways:

The array is initialized sequentially in the forward direction and then traversed in the same way. I refer to this pattern as forfor.
The array is initialized sequentially in the forward direction and then traversed sequentially in the backward direction (from the last element to the first). I refer to this pattern as forback.
The array is initialized sequentially in the backward direction and then traversed in the same way. I refer to this pattern as backback.

The x-axis represents element indices and the y-axis represents latencies in TSC cycles. I have configured my system so that a TSC cycle approximately equals a core cycle. I have plotted measurements for two runs of forfor called forfor1 and forfor2. The average per-element latencies are as follows:

forfor1: 9.9 cycles.
forfor2: 15 cycles.
forback: 35.8 cycles.
backback: 40.3 cycles.

L1 access latencies are particularly sensitive to any measurement noise. The L2 access latency is supposed to be 12 cycles on average, but we might still get a latency of 12 cycles for L1 hits because of noise of few cycles. In the first run of forfor, the majority of latencies are 4 cycles, which clearly indicate L1 hits. In the second run of forfor, the majority of latencies are 8 or 12 cycles. I think these are probably L1 hits as well. In both cases, there are some L3 hits and few main memory accesses. For both forback and backback, we can see that the majority of latencies are L3 hits. This means that the L3 prefetcher was able to handle both forward and backward traversals, but not the L1 and L2 prefetchers.

However, the accesses are performed in rapid succession one after the other with basically no computation in between. So if the L2 prefetcher did try to prefetch backwards, it might get the data too late, and so an L3-like latency is still incurred.

Note that I'm not flushing the cache between the two traversals of the array, and so the first traversal may impact the latencies measured in the second traversal.

This is the code I used to take measurements.

/* compile with gcc at optimization level -O3 */
/* set the minimum and maximum CPU frequency for all cores using cpupower to get meaningful results */ 
/* run using "sudo nice -n -20 ./a.out" to minimize possible context switches, or at least use "taskset -c 0 ./a.out" */
/* make sure all cache prefetchers are enabled */
/* preferrably disable HT */
/* this code is Intel-specific */
/* see the note at the end of the answer */

#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>

// 2048 iterations.
#define LINES_SIZE 64
#define ITERATIONS 2048 * LINES_SIZE
// Forward
#define START 0
#define END ITERATIONS
// Backward
//#define START ITERATIONS - LINES_SIZE
//#define END 0
#if START < END
#define INCREMENT i = i + LINES_SIZE
#define COMP <
#else
#define INCREMENT i = i - LINES_SIZE
#define COMP >=
#endif

int main()
{
  int array[ ITERATIONS ];
  int latency[ ITERATIONS/LINES_SIZE ];
  uint64_t time1, time2, al, osl; /* initial values don't matter */

  // Perhaps necessary to prevents UB?
  for ( int i = 0; i < ITERATIONS; i = i + LINES_SIZE )
  {
     array[ i ] = i; 
  }

  printf( "address = %p \n", &array[ 0 ] ); /* guaranteed to be aligned within a single cache line */

  // Measure overhead.
  _mm_mfence();                      
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize rdtsc with respect to trailing instructions + compiler barrier for rdtsc */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize rdtsc with respect to trailing instructions */
  osl = time2 - time1;

  // Forward or backward traversal.
  for ( int i = START; i COMP END; INCREMENT )
  {

     _mm_mfence();                      /* this properly orders both clflush and rdtsc */
     _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
     time1 = __rdtsc();                 /* set timer */
     _mm_lfence();                      /* serialize rdtsc with respect to trailing instructions + compiler barrier for rdtsc */
     int temp = array[ i ];             /* access array[i] */
     _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
     time2 = __rdtsc();
     _mm_lfence();                      /* serialize rdtsc with respect to trailing instructions */
     al = time2 - time1;

     printf( "array[ %i ] = %i \n", i, temp );         /* prevent the compiler from optimizing the load */
     latency[i/64] = al - osl;

  }

  // Output measured latencies.
  for ( int i = 0; i < ITERATIONS/LINES_SIZE; ++i )
  {
     printf( "%i \n", latency[i] );
  }

  return 0;
}

The purpose of these experiments is to measure individual access latencies to determine from which cache level each access is served. However, due to the presence of LFENCE instruction, the measurements may include latencies to that the load instruction requires in other stages of the pipeline. In addition, the compiler is placing some ALU instructions in the timed region an so the measurement may get impacted by these instructions (this can be avoided by writing the code in assembly). This can make it difficult to distinguish between accesses that hit in the L1 and those that hit in the L2. For example, some L1 latency measurements are being reported as 8 cycles. Nonetheless, the forback and backback measurements clearly show that most accesses are hitting in the L3.

If we were interested in measuring the average latency to access a particular level of the memory hierarchy, then using pointer chasing can provide more accurate results. In fact, this is the traditional way of measuring memory latency.

If you are accessing a large amounts of data in a pattern that is difficult for the hardware prefetchers (especially those at the L2 or L3) to predict, software prefetchering can be very beneficial. However, getting software prefetching right is hard in general. In addition, the measurements I got show that the L3 prefetcher can prefetch both forwards and backwards. If you have a good amount of parallelism both in terms of memory accesses and computations, then OoO execution can hide a significant fraction of the L3 access latency.

Important note on correctly running the program: It turns out that if I did not use the output redirection operator > to redirect all the output to a file, i.e., all the output will be printed on the terminal, all the measured latencies will be close to the L3 hit latency. The reason for this is that printf, which is called in every iteration, is polluting much of the L1 and L2 caches. So make sure to use the > operator. You can also use (void) *((volatile int*)array + i) instead of int tmp = array[i] as proposed in this and this answer. That would be even more reliable.

For arrays not much bigger than L2, looping forwards then backwards is pretty good; you get lots of cache hits at the turn-around. I think your `lfence` around every access is stopping OoO exec from hiding L2 hit latency, which it normally can. — Peter Cordes, Aug 21 '18 at 01:19
@PeterCordes Now I realize that I chose the size of array just to make it much larger (8x) than the L1D, but it's just mere coincidence that the size I chose is exactly equal to the L2 size in Haswell. That was not intentional. So ideally, all the lines should be in the L2 cache after the first traversal. However, the measurements indicate that there are lots of conflicts. The L2 is physically addressed so this depends on the physical addresses of the lines. The averages I've shown were calculated using single runs. It would certainly be better to repeat each experiment 10 times or so. — Hadi Brais, Aug 21 '18 at 04:06
You can use 2MB THP huge pages to avoid the physical address conflicts and get cleaner data. I would definitely remove the `lfence` around every access! — BeeOnRope, Aug 21 '18 at 05:11
@BeeOnRope and Peter: Remove the fences to measure what? My goal was to measure individual memory access latencies to determine at which cache level I'm getting a hit. If I remove the fences, I wound't be able to do that. But yes, in a real program, that would be the case. — Hadi Brais, Aug 21 '18 at 05:29
If your code is mostly executing lfence, you just end up mostly measuring the lfence cost. Do several dependent loads and time the whole sequence to get a good result. In uarch-bench this tends to give the right answer to 3 or 4 significant figures (eg 4.00 cycles for an L1 hit). — BeeOnRope, Aug 21 '18 at 05:52
@BeeOnRope Oh yes, that would be much better. But I still wouldn't be able to measure individual access latencies without using rdtsc instructions, which require the fences, right? So depdent loads seem to be a good idea if I wanted to measure the latency of particular memory level. — Hadi Brais, Aug 21 '18 at 05:56
@PeterCordes Is that what you meant by removing the lfences? — Hadi Brais, Aug 21 '18 at 06:05
I meant time summing the whole array, giving out-of-order execution the chance to work its magic so you're doing a realistic measurement of *actually* looping over an array forwards vs. backwards. @Bee's suggestion to do dependent loads could work, too, e.g. make each element a pointer to the next or previous element or cache line. That would let you detect if there really is no backward prefetching into L1d, only into L2. (Use actual pointers instead of array indices to save a cycle in address-generation.) — Peter Cordes, Aug 21 '18 at 06:17
Agreed. I still don't understand how @BeeOnRope's suggestion works. Let's say there are X dependent loads and let Y be the measured execution time for performing all the loads sequentially. If we divide Y by X and got about 4 cycles, then this is a good indication that most accesses hit in the L1. But if the result was 15 cycles, then I don't see how this can be interpreted beyond being the average access latency. — Hadi Brais, Aug 21 '18 at 06:30
You'd be measuring the actual load->use latency, rather than the execute->retire delay of a load that hits in L2. Those a probably similar or at least closely correlated, but I'd much rather see a test that let the pipeline do its thing. We know that current x86 doesn't do value-prediction. But anyway, looping backward over an array isn't normally doing dependent loads, so measuring L2 hit *latency* isn't what you should be measuring in the first place. IDK why @Bee suggested that, except to point out a better / less-noisy way to measure what your current code is already probably measuring. — Peter Cordes, Aug 21 '18 at 14:08
@Peter - yeah I mentioned how I would calculate _latency_ in the context of Hadi's answer which apparently measures that and because Hadi was mentioning access _latencies_. You are right though that the OPs question is all about throughout/parallel accesses - but my measurement advice would be the same except don't make the loads dependent. — BeeOnRope, Aug 21 '18 at 16:59
@Hadi - yes measuring X accesses can only give you an average but that is often good enough because: (a) the average is often all you care about, as in the OP wants to know how long it will take to access the array and probably doesn't care about the fine-grained "histogram" of individual access times. (b) You can use your mental model of the hardware and other information to make likely guesses about the fine-grained behavior even when your test only pops out averages. For example for most instructions you expect the latency to be constant, so an average gives you the right answer. — BeeOnRope, Aug 21 '18 at 17:03
For memory access, you have some published numbers so you can double check: eg if you make a test that should be entirely in L1 and you are getting 5.0 cycles average, it _could_ be some linear combination of 4 cycle L1 hits and X cycle L1 misses giving you that figure, or it could be all L1 hits that just have 5 cycle latency in your test due to the addressing mode or an extra instruction or whatever. Without other information I'd already pick the second option as more likely, but you could vary your test to try to tease out the effect, or check PMCs or whatever. — BeeOnRope, Aug 21 '18 at 17:06
It's also important to understand why lfence bracketed regions aren't that great for measuring small latencies. I get the appeal: it's a way to cut out all the OoO complexity and measure the end-to-end latency of an isolated sequence! A big problem is that real programs don't use lfence, so it already starts out fake (it's actually a huge problem in the specific scanrio of the OP compared to your test because it reduces MLP to 1 so the dominant factor is latency, but really the theoretical MLP is infinite and actual MLP will be "high" so the characteristics the test are quite different). — BeeOnRope, Aug 21 '18 at 17:16
lfence is implemented in some way that can vary by uarch and it may interact in unexpected ways with instructions in the measured region. For example, I have found that adding a single 1-cyle ALU op between lfences costs 2 or more cycles, compared to the expected 1 cycle. On some other arch it might cost zero cycles to add a few ALU ops if the way the lfence is implemented let's a few ops in for free. So the behavior for very small bits of code is already non-ideal: you aren't getting cycle accuracy. So you need to add more ops, which already points towards an averaging approach. — BeeOnRope, Aug 21 '18 at 17:19
So stringing together N copies of the code you want to test, while still "fake" compared to real code, is still very realistic in that in exercises the common OoO machinery that real code uses so your answer is in some sense correct for the uarch. Also, how you string together the code under test lets you choose explicitly whether you want to measure latency, throughput or something in between (ie, limited parallelism by having N independent streams of dependent sequences). You can also choose through which inputs and outputs you want the dependencies to flow. — BeeOnRope, Aug 21 '18 at 17:22

Xeon CPU (E5-2603) backward memory prefetch

1 Answers1

Linked