1

I have read that when accessing with a stride

for (int i = 0; i < aSize; i++) a[i] *= 3;

for (int i = 0; i < aSize; i += 16) a[i] *= 3;

both loops should perform similarly, as memory accesses are in a higher order than multiplication.

I'm playing around with google benchmark and while testing similar cache behaviour, I'm getting results that I don't understand.

template <class IntegerType>
void BM_FillArray(benchmark::State& state) {
    for (auto _ : state)
    {
        IntegerType a[15360 * 1024 * 2]; // Reserve array that doesn't fit in L3
        for (size_t i = 0; i < sizeof(a) / sizeof(IntegerType); ++i)
            benchmark::DoNotOptimize(a[i] = 0); // I have compiler optimizations disabled anyway
    }
}
BENCHMARK_TEMPLATE(BM_FillArray, int32_t);
BENCHMARK_TEMPLATE(BM_FillArray, int8_t);
Run on (12 X 3592 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 15360 KiB (x1)
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_FillArray<int32_t>     196577075 ns    156250000 ns            4
BM_FillArray<int8_t>      205476725 ns    160156250 ns            4

I would expect accessing the array of bytes to be faster than the array of ints as more elements fit in a cache line, but this is not the case.

Here are the results with optimizations enabled:

BM_FillArray<int32_t>   47279657 ns     47991071 ns           14
BM_FillArray<int8_t>    49374830 ns     50000000 ns           10

Anyone please can clarify this? Thanks :)

UPDATE 1:

I have read the old article "What programmers should know about memory" and everything is more clear now. However, I've tried the following benchmark:

template <int32_t CacheLineSize>
void BM_ReadArraySeqCacheLine(benchmark::State& state) {

    struct CacheLine
    {
        int8_t a[CacheLineSize];
    };
    vector<CacheLine> cl;
    int32_t workingSetSize = state.range(0);
    int32_t arraySize = workingSetSize / sizeof(CacheLine);
    cl.resize(arraySize);

    const int32_t iterations = 1536 * 1024;

    for (auto _ : state)
    {
        srand(time(NULL));
        int8_t res = 0;
        int32_t i = 0;
        while (i++ < iterations)
        {
            //size_t idx = i% arraySize;
            int idx = (rand() / float(RAND_MAX)) * arraySize;
            benchmark::DoNotOptimize(res += cl[idx].a[0]);
        }
    }
}
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 1)
    ->Arg(32 * 1024)    // L1 Data 32 KiB(x6)
    ->Arg(256 * 1024)   // L2 Unified 256 KiB(x6)
    ->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 64)
    ->Arg(32 * 1024)    // L1 Data 32 KiB(x6)
    ->Arg(256 * 1024)   // L2 Unified 256 KiB(x6)
    ->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 128)
    ->Arg(32 * 1024)    // L1 Data 32 KiB(x6)
    ->Arg(256 * 1024)   // L2 Unified 256 KiB(x6)
    ->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)

I would expect random accesses to perform much worse when working size doesn't fit the caches. However, these are the results:

BM_ReadArraySeqCacheLine<1>/32768        39936129 ns     38690476 ns           21
BM_ReadArraySeqCacheLine<1>/262144       40822781 ns     39062500 ns           16
BM_ReadArraySeqCacheLine<1>/15728640     58144300 ns     57812500 ns           10
BM_ReadArraySeqCacheLine<64>/32768       32786576 ns     33088235 ns           17
BM_ReadArraySeqCacheLine<64>/262144      32066729 ns     31994048 ns           21
BM_ReadArraySeqCacheLine<64>/15728640    50734420 ns     50000000 ns           10
BM_ReadArraySeqCacheLine<128>/32768      29122832 ns     28782895 ns           19
BM_ReadArraySeqCacheLine<128>/262144     31991964 ns     31875000 ns           25
BM_ReadArraySeqCacheLine<128>/15728640   68437327 ns     68181818 ns           11

what am I missing?

UPDATE 2:

I'm using now what you suggested (linear_congruential_engine) to generate the random numbers, and I'm using only static arrays, but the results are now even more confusing to me.

Here is the updated code:

template <int32_t WorkingSetSize, int32_t ElementSize>
void BM_ReadArrayRndCacheLine(benchmark::State& state) {

    struct Element
    {
        int8_t data[ElementSize];
    };
    constexpr int32_t ArraySize = WorkingSetSize / sizeof(ElementSize);
    Element a[ArraySize];

    constexpr int32_t iterations = 1536 * 1024;
    linear_congruential_engine<size_t, ArraySize/10, ArraySize/10, ArraySize> lcg; // I've tried with many params...
    
    for (auto _ : state)
    {
        int8_t res = 0;
        int32_t i = 0;
        while (i++ < iterations)
        {
            size_t idx =  lcg();
            benchmark::DoNotOptimize(res += a[idx].data[0]);
        }
    }
}

// L1 Data 32 KiB(x6)
// L2 Unified 256 KiB(x6)
// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 128);

BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 128);

BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 128);

BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 128);

Here are the results (optimizations enabled):

// First template parameter is working set size.
// Second template parameter is array elemeent size.
BM_ReadArrayRndCacheLine<32 * 1024, 1>             2833786 ns      2823795 ns          249
BM_ReadArrayRndCacheLine<32 * 1024, 64>            2960200 ns      2979343 ns          236
BM_ReadArrayRndCacheLine<32 * 1024, 128>           2896079 ns      2910539 ns          204

BM_ReadArrayRndCacheLine<256 * 1024, 1>            3114670 ns      3111758 ns          236
BM_ReadArrayRndCacheLine<256 * 1024, 64>           3629689 ns      3643135 ns          193
BM_ReadArrayRndCacheLine<256 * 1024, 128>          3213500 ns      3187189 ns          201

BM_ReadArrayRndCacheLine<15360 * 1024, 1>          5782703 ns      5729167 ns           90
BM_ReadArrayRndCacheLine<15360 * 1024, 64>         5958600 ns      6009615 ns          130
BM_ReadArrayRndCacheLine<15360 * 1024, 128>        5958221 ns      5998884 ns          112

BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 1>      6143701 ns      6076389 ns           90
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 64>     5800649 ns      5902778 ns           90
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 128>    5826414 ns      5729167 ns           90

How is it possible that for (L1d < workingSet < L2) results do not differ much against (workingSet < L1d)? Throughput and latency of L2 are still very high, but with the random accesses I'm trying to prevent prefetching and force cache misses.. so, why I'm not even noticing a minimal increment?

Even when trying to fetch from main memory (workingSet > L3) I'm not getting a massive performance drop. You mention that latest architectures can hold bandwidths of up to ~8bytes per clock, but I understand that they must copy a hold cache line, and that without prefetching with a predictable linear pattern, latency should be more noticeable in my tests... why is not the case?

I suspect that page faults and tlb may have something to do too.

(I've downloaded vtune analyzer to try to understand all this stuff better, but it's hanging on my machine and I've waiting for support)

I REALLY appreciate your help Peter Cordes :)

I'm just a GAME programmer trying to show my teammates whether using certain integer types in our code might (or not) have implications on our game performance. For example, whether we should worry about using fast types (eg. int_fast16_t) or using the least possible bytes in our variables for better packing (eg. int8_t).

  • 1
    *I have compiler optimizations disabled anyway* - then everything will be so slow that RAM can easily keep up with the CPU, even if you have a many-core Xeon (where single-core memory bandwidth is lower than quad-core desktops). But if that's a hex-core Intel "client" chip (I suspect not AMD from having 6 cores sharing an L3), then much more bandwidth available than you need to zero a byte or a dword every ~6 clock cycles. ([Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020)) – Peter Cordes Apr 28 '21 at 09:05
  • Does this answer your question? [Idiomatic way of performance evaluation?](https://stackoverflow.com/questions/60291987/idiomatic-way-of-performance-evaluation) – Peter Cordes Apr 28 '21 at 09:09
  • Also [C loop optimization help for final assignment (with compiler optimization disabled)](https://stackoverflow.com/a/32001196) / [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394) – Peter Cordes Apr 28 '21 at 09:10
  • With optimization enabled, they should both optimize to `memset(buf, 0, size)`, or with non-zero values, a SIMD store loop that stores a constant amount of contiguous bytes per instruction. Or if `DoNotOptimize` forces scalar stores, then 1x 32-bit store per clock cycle is still slow enough for DRAM to keep up with. – Peter Cordes Apr 28 '21 at 09:47
  • But even if cpu is slow, main memory could for instance take 200 stall cycles. I have enabled optimizations and I'm still getting quite similar results. Changing to a non-zero value doesn't help. I will read all your suggested threads, thanks. – Albert Caldas Apr 28 '21 at 09:48
  • 2
    200 stall cycles is *latency*, not throughput. HW prefetch and memory-level parallelism hide that. http://www.lighterra.com/papers/modernmicroprocessors/ has a section on memory. – Peter Cordes Apr 28 '21 at 09:49
  • Note in [How can cache be that fast?](https://electronics.stackexchange.com/q/329789), even a CPU almost a decade old (IvyBridge) with only a single channel of DRAM installed has a main-memory write bandwidth of ~9 GB/s, or 3 bytes per clock cycle for a single core. That would be about 6 B/c dual channel, more than enough to keep up with a single core doing one 32-bit store per clock cycle in a scalar loop that didn't get auto-vectorized or otherwise converted into something more like `memset`. – Peter Cordes Apr 28 '21 at 11:43
  • 1
    On a more modern CPU like a desktop Skylake, theoretical max is 34GB/s and a single core can use it almost all, about 8 bytes per core clock at 4GHz. On E5-1650 v4 ([6-core Broadwell-EP, 15MiB of L3](//en.wikichip.org/wiki/intel/xeon_e5/e5-1650_v4)), it's even higher (~71GiB/s across all 4 channels), but per-core bandwidth can be limited to a lot less than the max aggregate. (But still over 10GiB/s of read *and* write in a memcpy, see link in my first comment, should be enough for scalar stores.) *Normal code on a single core and go vastly faster than 4B / 200 cycles because of HW prefetch.* – Peter Cordes Apr 28 '21 at 11:47
  • 1
    re: update with benchmark: `rand()` is pretty slow; you're bottlenecking on `rand()` and division throughput for L1d and L2 sizes, with L3 cache-miss being slow enough that it's only partially hidden by OoO exec. (Hiding cache miss latency as much as possible by overlapping with other computation is one of the major goals / benefits of out-of-order exec. Your benchmark allows as much memory-level parallelism as the machine can manage, if it even needs that much to keep up with slower `rand()`.) Also, don't `srand()` inside the timed function; do that once or not at all. – Peter Cordes May 03 '21 at 16:58
  • 1
    A fast xorshift+ might or LCG that avoids division might work well; you don't need high quality randomness to defeat hardware prefetching. You're probably looping enough times that allocating the vector inside the timed function is ok even with page-fault overhead. Normal `std::vector` can't efficiently zero-init without dirtying pages, so you're probably not seeing any lazy-allocation effects like having multiple virtual pages mapped to the same physical page of zeros for TLB misses but L1d hits. – Peter Cordes May 03 '21 at 16:58
  • 1
    LCG with multiplier (`a`) = ArraySize/10 is probably just barely a large enough stride for the hardware prefetcher to not benefit much from locking on to it. But normally IIRC you want a large odd number or something (been a while since I looked at the math of LCG param choices), otherwise you risk only touching a limited number of array elements, not eventually covering them all. (You could test that by storing a `1` to every array element in a random loop, then count how many array elements got touched, i.e. by summing the array, if other elements are 0.) – Peter Cordes May 04 '21 at 12:52
  • 1
    Re: why L2 isn't slower than L1: out-of-order exec is sufficient to hide L2 latency. Your Skylake-derived CPU has an out-of-order scheduler (RS) of 97 uops, and a ROB size of 224 uops (like https://www.realworldtech.com/haswell-cpu/3/ but larger), and 12 LFBs to track cache lines it's waiting for. As long as the CPU can keep track of enough in-flight loads (latency * bandwidth), having to go to L2 is not a big deal. Ability to hide cache misses is one way to measure out-of-order window size in the first place: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ – Peter Cordes May 04 '21 at 12:57
  • 1
    ~12-cycle L2 latency is too much for the CPU to manage 2 cache-lines per clock cycle from L2 the way it can from L1, but even an LCG with `m` = a power of 2 (or even free like unsigned max) is likely still too slow to allow that. (mod by a compile-time constant is pretty fast, but its total latency is still *several* clock cycles, including two `imul` latencies). So actually, making your array size a power of 2 could significantly shorten LCG latency (by turning `mod m` into an `and` instruction), like from imul(3) + add(1) + maybe 8 or 9 cycles down to just 5 total cycles. – Peter Cordes May 04 '21 at 13:02
  • 1
    Still, one load per 5 cycles is very easy for the CPU to keep up with when they all hit in L2, and probably even L3. Running multiple independent LCGs in an unrolled loop could work. – Peter Cordes May 04 '21 at 13:04
  • I'll try what you suggest, thanks!!! By the way, storing the random indices in a secondary array would work? Thanks to prefetching we would have the index immediately on each iteration. We would have two streams of data but for comparison maybe it's enough? I have tried this, with the rand() algorithm, and now I can see big steps between working sizes. – Albert Caldas May 04 '21 at 14:01
  • 1
    Yeah, I had the same idea of reading random indices from an array instead of generating as you go, while editing my answer. :P See the update. You should see more of an effect with that. Also note the point about your LCG parameters not covering the whole array `size/10 + x * (size/10) % size` isn't going to give you good coverage! – Peter Cordes May 04 '21 at 14:05

1 Answers1

2

Re: the ultimate question: int_fast16_t is garbage for arrays because glibc on x86-64 unfortunately defines it as a 64-bit type (not 32-bit), so it wastes huge amounts of cache footprint. The question is "fast for what purpose", and glibc answered "fast for use as array indices / loop counters", apparently, even though it's slower to divide, or to multiply on some older CPUs (which were current when the choice was made). IMO this was a bad design decision.

Generally using small integer types for arrays is good; usually cache misses are a problem so reducing your footprint is nice even if it means using a movzx or movsx load instead of a memory source operand to use it with an int or unsigned 32-bit local. If SIMD is ever possible, having more elements per fixed-width vector means you get more work done per instruction.

But unfortunately int_fast16_t isn't going to help you achieve that with some libraries, but short will, or int_least16_t.


See my comments under the question for answers to the early part: 200 stall cycles is latency, not throughput. HW prefetch and memory-level parallelism hide that. Modern Microprocessors - A 90 Minute Guide! is excellent, and has a section on memory. See also What Every Programmer Should Know About Memory? which is still highly relevant in 2021. (Except for some stuff about prefetch threads.)


Your Update 2 with a faster PRNG

Re: why L2 isn't slower than L1: out-of-order exec is sufficient to hide L2 latency, and even your LGC is too slow to stress L2 throughput. It's hard to generate random numbers fast enough to give the available memory-level parallelism much trouble.

Your Skylake-derived CPU has an out-of-order scheduler (RS) of 97 uops, and a ROB size of 224 uops (like https://realworldtech.com/haswell-cpu/3 but larger), and 12 LFBs to track cache lines it's waiting for. As long as the CPU can keep track of enough in-flight loads (latency * bandwidth), having to go to L2 is not a big deal. Ability to hide cache misses is one way to measure out-of-order window size in the first place: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity


Latency for an L2 hit is 12 cycles (https://www.7-cpu.com/cpu/Skylake.html). Skylake can do 2 loads per clock from L1d cache, but not from L2. (It can't sustain 1 cache line per clock IIRC, but 1 per 2 clocks or even somewhat better is doable).

Your LCG RNG bottlenecks your loop on its latency: 5 cycles for power-of-2 array sizes, or more like 13 cycles for non-power-of-2 sizes like your "L3" test attempts1. So that's about 1/10th the access rate that L1d can handle, and even if every access misses L1d but hits in L2, you're not even keeping more than one load in flight from L2. OoO exec + load buffers aren't even going to break a sweat. So L1d and L2 will be the same speed because they both user power-of-2 array sizes.

note 1: imul(3c) + add(1c) for x = a * x + c, then remainder = x - (x/m * m) using a multiplicative inverse, probably mul(4 cycles for high half of size_t?) + shr(1) + imul(3c) + sub(1c). Or with a power-of-2 size, modulo is just AND with a constant like (1UL<<n) - 1.

Clearly my estimates aren't quite right because your non-power-of-2 arrays are less than twice the times of L1d / L2, not 13/5 which my estimate would predict even if L3 latency/bandwidth wasn't a factor.

Running multiple independent LCGs in an unrolled loop could make a difference. (With different seeds.) But a non-power-of-2 m for an LCG still means quite a few instructions so you would bottleneck on CPU front-end throughput (and back-end execution ports, specifically the multiplier).

An LCG with multiplier (a) = ArraySize/10 is probably just barely a large enough stride for the hardware prefetcher to not benefit much from locking on to it. But normally IIRC you want a large odd number or something (been a while since I looked at the math of LCG param choices), otherwise you risk only touching a limited number of array elements, not eventually covering them all. (You could test that by storing a 1 to every array element in a random loop, then count how many array elements got touched, i.e. by summing the array, if other elements are 0.)

a and c should definitely not both be factors of m, otherwise you're accessing the same 10 cache lines every time to the exclusion of everything else.

As I said earlier, it doesn't take much randomness to defeat HW prefetch. An LCG with c=0, a= an odd number, maybe prime, and m=UINT_MAX might be good, literally just an imul. You can modulo to your array size on each LCG result separately, taking that operation off the critical path. At this point you might as well keep the standard library out of it and literally just unsigned rng = 1; to start, and rng *= 1234567; as your update step. Then use arr[rng % arraysize].

That's cheaper than anything you could do with xorshift+ or xorshft*.


Benchmarking cache latency:

You could generate an array of random uint16_t or uint32_t indices once (e.g. in a static initializer or constructor) and loop over that repeatedly, accessing another array at those positions. That would interleave sequential and random access, and make code that could probably do 2 loads per clock with L1d hits, especially if you use gcc -O3 -funroll-loops. (With -march=native it might auto-vectorize with AVX2 gather instructions, but only for 32-bit or wider elements, so use -fno-tree-vectorize if you want to rule out that confounding factor that only comes from taking indices from an array.)

To test cache / memory latency, the usual technique is to make linked lists with a random distribution around an array. Walking the list, the next load can start as soon as (but not before) the previous load completes. Because one depends on the other. This is called "load-use latency". See also Is there a penalty when base+offset is in a different page than the base? for a trick Intel CPUs use to optimistically speed up workloads like that (the 4-cycle L1d latency case, instead of the usual 5 cycle). Semi-related: PyPy 17x faster than Python. Can Python be sped up? is another test that's dependent on pointer-chasing latency.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    re: "but 1 per 2 clocks or even somewhat better is doable" you can get 2 per 3 clock cycles from L2. See [Travis Downs has a writeup on it](https://github.com/travisdowns/uarch-bench/wiki/How-much-bandwidth-does-the-L2-have-to-give,-anyway%3F). Also regarding the `fast_int*` going to `int64` on x86_64. Think worth a patch? My intuition would be int8 -> int8, int16 -> int32, int32 -> int32, int64 -> int64. – Noah May 04 '21 at 18:46
  • re: "Clearly my estimates aren't quite right" possibly because `mul` on p0 is the bottleneck some of the ALU like `sub` even though they are on the critical path in latency end up as free because they don't take p0 resources so don't affect the actual bottleneck on throughput? – Noah May 04 '21 at 18:53