Benchmarking cache latency with a C/Assembly program, why do my caches appear to be 8x?

Question

I wrote a small program to benchmark CPU cache latencies. The idea is filling a buffer with random values and letting the CPU load from it with strides that depend on the loaded values themselves.

Depending on the size of the buffers I expect to see latencies being in some well defined clusters. I must say that I see a clear pattern, and that's a success. The sequential read programs from figures 3.10, 3.11 in this classic do not work anywhere as neatly for me, presumably because prefetching improved a lot from 2007.

What's strange is that the pattern I get is consistent with a cache 8 times as big as mine. So my question.

My system (2-core MacBook Pro, mid 2014) has L1d cache 32k, L2 cache 256k, L3 cache 3M (shared), architecture is Intel Haswell.

The pattern I see is the following:

Size 1024 Loops 10240000 Cycles 7.60162
Size 2048 Loops 10240000 Cycles 7.14387
Size 4096 Loops 10240000 Cycles 7.74612
Size 8192 Loops 10240000 Cycles 6.93018
Size 16384 Loops 10240000 Cycles 7.32189
Size 32768 Loops 10240000 Cycles 7.84709
Size 65536 Loops 10240000 Cycles 8.32192 # <- L1 cache is 32k so I expect a step here
Size 131072 Loops 10240000 Cycles 7.51579
Size 262144 Loops 10240000 Cycles 9.07455
Size 524288 Loops 10240000 Cycles 16.1824 # <- L1 step is here instead / here I would expect a step associated with L2
Size 1048576 Loops 10240000 Cycles 19.0783
Size 2097152 Loops 10240000 Cycles 11.633
Size 4194304 Loops 10240000 Cycles 23.773 # <- L2 step is here instead 
Size 8388608 Loops 10240000 Cycles 24.2754
Size 16777216 Loops 10240000 Cycles 61.0624 # <- L3 step is here, apparently (makes sense, since L3 is shared, that it comes a bit earlier than expected)
Size 33554432 Loops 10240000 Cycles 57.5953
Size 67108864 Loops 10240000 Cycles 44.3678

Now, admittedly it can be that I am just unable to measure L1, and I am taking the L2 step as the L1 one, but if I look at a previous answer, from the 32bit era, I guess, I see a telling line (emphasis on the * 4):

test_cache(attempts, cache_sizes[i] * 4, latencies, sizeof(latencies) / sizeof(*latencies));

So for some reason the person who answered tested, too, latency in a buffer 4 times the nominal size of the cache.

I am puzzled - what is the reason for that?

Here is my code (it runs in macOS and clang but it can easily be adapted to other systems)

mwe.c

#include <stdio.h>
#include "x86intrin.h"
#include <fcntl.h>
#include <unistd.h>

#define N (134217728)
#define START_N (1024)

extern uint64_t access_random_place_to_place_memory_dep(uint64_t *, size_t, size_t);

int main (int argc, char** argv) {
    unsigned long long n = N, ta, tb;
    unsigned long long int loops = 10240000;

    // create buffer of random memory
    uint64_t *p = malloc(n);
    uint64_t res;
    int randomData = open("/dev/urandom", O_RDONLY);
    read(randomData, p, n);

    // result arrays
    double results[64];
    size_t memories[64];
    int i;
    for (int working_memory=START_N; working_memory < n; working_memory <<= 1) {
        ta = _rdtsc();
        access_random_place_to_place_memory_dep(p, working_memory, loops);
        tb = _rdtsc();
        memories[i] = working_memory;
        results[i] = (double)(tb-ta)/loops;
        i++;
    }
    free(p);

    for (int j=0; j<i; j++) {
        printf("Size %zu Loops %llu Cycles %g\n", memories[j], loops, results[j]);
    }
    return res;
}

mwe.s

.intel_syntax

.global _access_random_place_to_place_memory_dep

.section __DATA,__data


.section __TEXT,__text

/*
Access memory randomly, by pointer chasing,
each stride depends on the last read
*/
// (rdi, rsi, rdx) -> 
// (rax) pointer to buffer
// (rsi) size of buffer (power of 2)
// (rdx) iterations
// no need to pad stack since no function call
_access_random_place_to_place_memory_dep:
    mov rbx, [rdi]
    xor rcx, rcx
// will use as AND mask
    dec rsi 
beginz:
    cmp rdx, rcx
    jbe endz
    inc rcx
    and rbx, rsi
    lea rbx, [rdi + rbx]
    mov r8, [rbx]
    add rbx, r8
    jmp beginz
endz:
    mov rax, rbx
    ret

to compile clang mwe.c mwe.s -o mwe and run with ./mwe

EDIT (2022/3/23)

Thanks for the corrections! I present the current version and its output. I hope it can help somebody as I haven't found such a code easily myself

mwe.c

#include <stdlib.h>
#include <stdio.h>
#include "x86intrin.h"
#include <fcntl.h>
#include <unistd.h>

#define N (134217728)
#define START_N (128)

extern uint64_t access_random_place_to_place_memory_dep(uint64_t *, size_t, size_t);

void place_random_shuffle(uint64_t *p, uint64_t max_offset) {
    uint64_t max_offset_q = max_offset/8;

    // start by writing for each qword its own offset
    for (uint64_t i=0; i<max_offset_q; i++) {
        p[i] = 8*i;
    }

    // then shuffle (Fisher Yates shuffling)
    for (uint64_t i=0; i<max_offset_q-1; i++) {
        uint64_t t;
        uint64_t j = (rand() % (max_offset_q-i)) + i;
        t = p[i];
        p[i] = p[j];
        p[j] = t;
    }

}


int main (int argc, char** argv) {
    unsigned long long n = N, ta, tb;
    unsigned long long int loops = 10240000;

    // create buffer of random memory
    uint64_t *p = malloc(n);
    uint64_t res;

    // result arrays
    double results[64];
    size_t memories[64];
    int i;
    for (int working_memory=START_N; working_memory < n; working_memory <<= 1) {
        place_random_shuffle(p, working_memory);
        ta = _rdtsc();
        res = access_random_place_to_place_memory_dep(p, working_memory, loops);
        tb = _rdtsc();
        memories[i] = working_memory;
        results[i] = (double)(tb-ta)/loops;
        i++;
    }
    free(p);

    for (int j=0; j<i; j++) {
        printf("Size %zu Loops %llu Cycles %g\n", memories[j], loops, results[j]);
    }
    return res;
}

mwe.s

.intel_syntax

.global _access_random_place_to_place_memory_dep

.section __DATA,__data

.section __TEXT,__text

/*
Access memory randomly, by pointer chasing,
each stride depends on the last read
*/
// (rdi, rsi, rdx) -> 
// (rdi) pointer to buffer
// (rsi) size of buffer
// (rdx) iterations (must be >1)
// no need to pad stack since no function call
_access_random_place_to_place_memory_dep:
    xor rax, rax
    xor r8, r8
beginz:
    mov rax, [rdi+rax]
    add r8, rax
    dec rdx
    jnz beginz
endz:
    mov rax, r8
    ret

Output is much better now!

Size 128 Loops 10240000 Cycles 4.51077
Size 256 Loops 10240000 Cycles 4.67502
Size 512 Loops 10240000 Cycles 4.46404
Size 1024 Loops 10240000 Cycles 4.47518
Size 2048 Loops 10240000 Cycles 4.67881
Size 4096 Loops 10240000 Cycles 4.5293
Size 8192 Loops 10240000 Cycles 4.36537
Size 16384 Loops 10240000 Cycles 4.56763
Size 32768 Loops 10240000 Cycles 4.59288
Size 65536 Loops 10240000 Cycles 8.66269 # <- L1 step
Size 131072 Loops 10240000 Cycles 9.48717
Size 262144 Loops 10240000 Cycles 15.2417
Size 524288 Loops 10240000 Cycles 27.2223 # <- L2 (?) step
Size 1048576 Loops 10240000 Cycles 47.3154
Size 2097152 Loops 10240000 Cycles 104.716 # <- we start to be out of L3
Size 4194304 Loops 10240000 Cycles 177.333
Size 8388608 Loops 10240000 Cycles 218.087
Size 16777216 Loops 10240000 Cycles 234.883
Size 33554432 Loops 10240000 Cycles 242.916
Size 67108864 Loops 10240000 Cycles 260.416

And the "cycles" are actually "psudocycles" because of turbo mode, but I don't know how to do take that into account.

Scaling reference cycles (which you're calling pseudocycles) to core clock cycles can most easily be done by assuming the CPU runs at max turbo throughout, and hard-coding the ratio. For size 1, your loop should run at exactly 5 cycles per iteration. (Or use an `imul eax,eax` chain which is 3 cycles per imul). If you measure wall-clock time and TSC ticks, you can find that frequency and the known core clock cycles. (Or maybe google the TSC frequency on your exact CPU model. e.g. Linux calibrates it, like `tsc: Refined TSC clocksource calibration: 4008.057 MHz` kernel log on my i7-6700k) — Peter Cordes, Mar 23 '22 at 18:41
If you had something like Linux `perf` or Intel VTune, you could also have it count hardware PMU events for core cycles and ref-cycles. (`cpu_clk_unhalted.ref_tsc` and `cpu_clk_unhalted.thread`). That would directly give you the core vs. TSC ratio for a given run. — Peter Cordes, Mar 23 '22 at 18:45

score 2 · Answer 1 · answered Mar 23 '22 at 03:25

With truly random data raw from /dev/urandom, you don't know you're touching it all. You could easily get into a cycle that touches only a fraction of the buffer. It sounds about right that after about touching some fraction of the cache lines in the buffer, you randomly access a byte offset that you've accessed before, creating a cycle of some length. Where that length depends on how far back in the sequence you go. I could easily believe 1/4 or 1/8th of the total cache lines as plausible for a typical random run.

And/or that there is some locality within the cycle, like staying within some subset of cache lines touching them multiple times before getting to other cache lines.

What you want is a random permutation of qword offsets 0..n-1, I think. (Or 0, 8, 16, ... 8*(n-1) since you're using them as byte offsets, unscaled.) Or for smaller sizes, since you're masking away high bytes, you could treat it as a buffer of 32-bit or even 24-bit values; doing 8-byte loads seems likely to introduce occasional cache-line splits even though you'll discard the high bytes.

Something that guarantees that you touch each qword once before coming back to the same one. (Or even better, that on a cache-line basis, maybe only using one qword of each cache line.)

Using arbitrary byte offsets that could partially overlap each other seems harder to avoid creating a cycle with shorter period that wouldn't touch every cache line.

Your asm function clobbers RBX without saving/restoring it; it's call-preserved in normal x86 calling conventions, including x86-64 SysV. This is an unlikely symptom if the caller is depending on that, but you should fix that. You aren't using RAX until the end, so you might as well just use RAX instead of RBX.

Also, you could simplify the loop by putting dec rdx / jnz beginz at the bottom, like a do{}while(--rdx) on the assumption that the iteration count arg is non-zero. (Or if you don't want to assume that, test rdx,rdx / jz .end to skip the loop, still allowing an efficient loop with a single branch at the bottom.) Why are loops always compiled into "do...while" style (tail jump)?

And I don't see the point of the LEA; just use the indexed addressing mode with mov or even add rbx, [rdi+rbx] (which can micro-fuse into a load+add uop, and on Haswell stay micro-fused even in the ROB despite the indexed addressing mode).

Haswell's L1d load-use latency is 5 cycles except when truly pointer-chasing (with an addressing mode like [reg + 0..2047] where the reg value came directly from a load, no and involved). And where the +disp displacement doesn't take it into the next page. See Is there a penalty when base+offset is in a different page than the base?

Also note that the RDTSC reference frequency is usually not the CPU frequency, unless you disable turbo. Then on a Haswell it's normally quite close, since especially older CPUs have a TSC frequency very close to their rated sustained non-turbo frequency. With turbo, the CPU frequency is a lot higher, especially for low-power laptop CPUs running low-throughput workloads like pointer-chasing that don't make much heat.

That's fine, if you realize what the TSC is vs. the actual core clock, but don't assume that the numbers are the actual length in core cycles of your loop-carried dep chain (including and/lea/add, as well as a load. So it should be 8 core clock cycles with all L1d hits. Or 7 if you remove the useless LEA and let that addition happen in the load port AGU.)

Thank you so much!! I corrected the code according to what you wrote. You seem to be much more expert than me, I am just a curious programmer. Could you look at the corrected version? I would be grateful for any insight! — Ralph, Mar 23 '22 at 13:18
Another question, if I may. For very small sizes on a Linux Ivy bridge E machine (the CPU going back and forth around the same 14-15 addresses, or the same two cache lines) I see a very consistent higher latency than with sligthly higher size: ( Size 112 Cycles 14.1617; Size 120 Cycles 14.1738; Size 128 Cycles 14.1199; Size 136 Cycles 8.23; Size 144 Cycles 5.00002) Do you happen to know why? — Ralph, Mar 23 '22 at 15:22
@Ralph: This is with your new code that won't do misaligned accesses, right? That's why non-power-of-2 sizes are testable at all, since you're not using AND to modulo. IDK, seems strange. With everything dependent, there shouldn't be any bank conflicts, and the lines involved are contiguous so we shouldn't be getting conflict misses. — Peter Cordes, Mar 23 '22 at 18:27
@Ralph: Your updated asm looks more sensible, and so does C that does a shuffle. You could save the `mov rax,r8` instruction at the end if you swap how you use those registers, like `mov r8, [rdi+r8]` / `add rax, r8`. IDK why want to sum the offsets you've seen, but that's fortunately off the critical path, with just MOV feeding MOV. main doesn't currently use the return value as a checksum or anything. (If you were loading actual pointers with `mov rdi, [rdi]`, from a shuffled array that started as `p[i] = &p[i]`, you'd see the 4 cycle fast-path, with everything being 1 cycle shorter.) — Peter Cordes, Mar 23 '22 at 18:35
@Ralph: If you still can't explain that result, you could post a new question with a [mcve] for your test code so people can run that on their own machines with profiling. The variation seems too big for varying turbo frequency, and you run plenty long enough that warm-up effects won't be significant. — Peter Cordes, Mar 25 '22 at 00:56

Benchmarking cache latency with a C/Assembly program, why do my caches appear to be 8x?

1 Answers1