2

I'm trying to figure out what would be the best approach to benchmark C++ programs and would like to simulate both the scenario when the data related to the benchmarked section is present in the cache and when it's cold.

Is there a reliable way to enforce good and bad cache locality on x86-64 machines as a form of preparation for a test run, assuming the data it will involve is known?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Big Temp
  • 434
  • 4
  • 12
  • 1
    You will get good cache hit rates for successive accesses to memory. If you know the cache sizes, you can access as many elements as fit in one cache line. For a bad cache hit rate you can just use random access to memory. This will almost certainly trash the cache every time. There are a lot more things to take into consideration but the question is too broad to get into every detail. – OutOfBound Mar 15 '20 at 23:03
  • Getting caches hot before a run is easy: do a warm-up run first. You should do that anyway to avoid lots of bad effects. e.g. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) describes many pitfalls. Looping over a *different* large array will typically evict most other data from caches. – Peter Cordes Mar 15 '20 at 23:06
  • Typically what one does for testing the 'cached' scenario is run the test function once and throw away the measurement data, then start the actual test. This is not perfect as, of course, the function itself may invalidate its own cache. Second, to test the 'cold' scenario make sure to poison the cache by loading a bunch of data. This is one of the rare cases, where you might find `std::list` useful. – bitmask Mar 15 '20 at 23:14
  • Did you try cachgrind? https://valgrind.org/docs/manual/cg-manual.html – Severin Pappadeux Mar 15 '20 at 23:51

1 Answers1

2

Presumably you are benchmarking an algorithm that performs an operation over a range of objects, and you care about the locality of those objects in memory (and thus cache).

To "simulate" locality: Create locality. You can create a linked list with high locality as well as linked list with low locality:

Allocate the nodes in an array. To create a list with high locality, make sure that first element of the array points to the second, and so on. To create list with lower locality, create a random permutation of the order so that each node points to another in a random position of the array.

Make sure that number of elements at least a magnitude greater than the largest cache.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • iterating through std::list vs. std::vector isn't just contiguous vs. scattered; a linked list creates a bottleneck on memory/cache load-use latency (typically 4 cycles for an L1d hit on mainstream x86 CPUs). But with elements *known at compile time to contiguous* in an array / std::vector, the address can be computed with a simple pointer increment creating memory-level parallelism. (And allowing auto-vectorization). That's vastly better than a linked-list that just happens to have its nodes allocated contiguously. A cache miss on one element in a pointer-chasing loop blocks later iterations – Peter Cordes Mar 16 '20 at 01:13
  • @PeterCordes Good point. Disabling vectoriasation optimisation may help with that. As well as making the size of the array unknown at compile time. – eerorika Mar 16 '20 at 01:55
  • That only helps a minor amount. Having a minimum 4-cycle load-use latency for pointer chasing between elements (`mov rdi, [rdi]`) is vastly worse than an `add rdi, 4` pointer increment for something like summing the elements of a container. It destroys memory-level parallelism. (Although hardware prefetch can still work if the list nodes happen to be contiguous). Also don't forget that the `next` pointer takes space, so `list` has 16-byte elements vs. `vector` having 4-byte elements (on a typical implementation like x86-64). – Peter Cordes Mar 16 '20 at 02:02
  • You simply can't drop in list vs. vector and get meaningful for loops that iterate over a container results, even if you compile with `gcc -O3 -fno-tree-vectorize`. It introduces so many confounding factors that you'd have to account for with detailed knowledge of compilers, asm, and CPUs that you might as well just do some perf analysis on the real loop you care about tuning! – Peter Cordes Mar 16 '20 at 02:03
  • 1
    @PeterCordes I rewrote the answer to suggest using same data structure in both cases. – eerorika Mar 16 '20 at 02:48
  • Yup, that works, to model some kinds of problems. Some other problems have low locality but still have memory level parallelism (e.g. looking up hashes of sequential keys, or histogramming random data into a big array of counters). Caches can hit or start another off-core request while an earlier cache miss is still outstanding, but only if execution to compute the next address was independent of the previous address. (e.g. each Skylake core has 12 line-fill buffers for L1->L2 requests, and IIRC L2 has 16 superqueue entries for off-core requests for demand misses plus HW prefetch.) – Peter Cordes Mar 16 '20 at 02:59