What is responsible for a RAM bottleneck when performing random memory accesses?

Question

I have a program which accesses single bytes in a large array at random. Since this array exceeds the L2 cache, it requires many queries to RAM. I created a benchmark which emulates this program by generating random numbers and querying a large array. It seems like the benchmark is under-performing relative to my RAM's advertised speed.

I have DDR4-2933 RAM which is supposed to handle 2933 MT/s. The maximum transfer rate I have been able to achieve is 56.8 MT/s. What would be the bottleneck preventing faster execution? If I had to speculate, I might say the CPU could only reorder instructions within some fixed window and this would limit the parallelization of the fetches. Although, I have no evidence beyond the benchmarks.

Benchmark & Methodology

I created a short program which populates a large array with random numbers. Then, it loads values at random offsets from the array and XORs them together. Memory accesses should be reorderable.

#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <sys/mman.h>

/* XOROSHIRO PRNG */

static inline uint64_t rotl(const uint64_t x, int k) {
    return (x << k) | (x >> (64 - k));
}

static uint64_t s[4];

uint64_t next(void) {
    const uint64_t result = rotl(s[1] * 5, 7) * 9;

    const uint64_t t = s[1] << 17;

    s[2] ^= s[0];
    s[3] ^= s[1];
    s[1] ^= s[2];
    s[0] ^= s[3];

    s[2] ^= t;

    s[3] = rotl(s[3], 45);

    return result;
}

/* BENCHMARK */

int main(int argc, char ** argv) {
    char const * prog = argc > 0 ? argv[0] : "[program]";

    if(argc != 3) {
        fprintf(stderr, "Usage: %s [size (power of 2)] [n_runs]", prog);
        exit(1);
    }

    unsigned int sshift = strtol(argv[1], NULL, 10);
    uint64_t calls = strtol(argv[2], NULL, 10);

    size_t size = 1ull << sshift;
    uint64_t smask = (1ull << sshift) - 1;

    uint8_t * buf = malloc(size);
    if(!buf) {
        fprintf(stderr, "no mem");
        exit(1);
    }

    // Seed PRNG with magic values
    s[0] = 0x0BE38E2AC;
    s[1] = 0x23D933C53;
    s[2] = 0xE72482E32;
    s[3] = 0x35C339D23;

    for(size_t i = 0; i < size; i++) {
        buf[i] = next();
    }

    clock_t start = clock();
    double cpu_time_used;

    uint8_t val = 0;
    for(size_t i = 0; i < calls; i++) {
        uint64_t idx = next() & smask;
        val ^= buf[idx];
    }

    cpu_time_used = ((double) (clock() - start)) / CLOCKS_PER_SEC;

    printf("time: %.3f\n", cpu_time_used);
    printf("calls: %ld\n", calls);
    printf("time/call: %.3e\n", cpu_time_used / calls);
    printf("MT / sec: %.3e\n", calls / cpu_time_used / 1e6);
    printf("val: %d\n", val); // ensure val is not optimized out
}

All benchmarks were executed on a Intel(R) Core(TM) i9-10885H CPU running at 2.40 GHz with CPU scaling disabled. The program was compiled with gcc -O3 main.c -o main -g -O3 -flto and called with nice -n -2 ./main 31 1000000000 for an array of size 2^31 and 1e9 accesses. The number of transactions per second issued to RAM can be approximated by the number of bytes fetched since virtually all of the lookups result in cache missed. I tested this using perf and it seemed like around 95% of cache lookups resulted in misses.

The random number generator occupies about 7.1% of program overhead. This was measured by removing the line val ^= buf[idx]; which executes the fetch. perf reported around 99% of memory overhead was due to last-level cache misses.

Follow-up Benchmarks

Loop unrolling:

By default, GCC 12.2.0 did not unroll the loop. It took some cajoling. I replaced:

for(size_t i = 0; i < calls; i++) {
    uint64_t idx = next() & smask;
    val ^= buf[idx];
}

with:

for(size_t i = 0; i < batches; i++) {
#pragma GCC unroll 3
    for(size_t j = 0; j < 8; j++) {
        uint64_t idx = next() & smask;
        val ^= buf[idx];
    }
}

I look five timings at 2.0 GHz. The version without loop unrolling seemed to perform better but not significantly. I'm not very surprised since the branch would seem highly predictable.

   no unrolling  unrolling
0        23.249     24.287
1        24.044     24.520
2        23.455     24.358
3        23.233     24.167
4        22.969     23.833

Multi-processing

I implemented a threaded version of the benchmark. It evenly divides the accesses among the threads. I also switch to measuring wall clock time. Here are the measurements (taken at 2.4 GHz):

   threads    time        MT/s
0        1  17.345   57.653502
1        2   8.949  111.744329
2        4   5.022  199.123855
3        8   3.533  283.045570
4       16   3.203  312.207306
5       32   3.200  312.500000
6       64   3.173  315.159155
7      124   3.165  315.955766

Modified program:

#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <sys/mman.h>
#include <threads.h>
#include <pthread.h>

/* XOROSHIRO PRNG */

static inline uint64_t rotl(const uint64_t x, int k) {
    return (x << k) | (x >> (64 - k));
}

static thread_local uint64_t s[4];

static uint64_t next(void) {
    const uint64_t result = rotl(s[1] * 5, 7) * 9;

    const uint64_t t = s[1] << 17;

    s[2] ^= s[0];
    s[3] ^= s[1];
    s[1] ^= s[2];
    s[0] ^= s[3];

    s[2] ^= t;

    s[3] = rotl(s[3], 45);

    return result;
}

struct xor_args {
    size_t iters;
    uint8_t * buf;
    uint8_t res;
    uint64_t smask;
    pthread_t id;
};

#include <sys/random.h>

void * xor_worker(struct xor_args * a) {
    if(-1 == getrandom(&s, sizeof(s), GRND_RANDOM)) {
        fprintf(stderr, "getrandom() failed\n");
        exit(1);
    }

    uint8_t val = 0;
    for(size_t i = 0; i < a->iters; i++) {
        uint64_t idx = next() & a->smask;
        val ^= a->buf[idx];
    }

    a->res = val;

    return NULL;
}

/* BENCHMARK */

int main(int argc, char ** argv) {
    char const * prog = argc > 0 ? argv[0] : "[program]";

    if(argc != 4) {
        fprintf(stderr, "Usage: %s [size (power of 2)] [n_runs] [threads]", prog);
        exit(1);
    }

    unsigned int sshift = strtol(argv[1], NULL, 10);
    uint64_t calls = strtol(argv[2], NULL, 10);
    int threads = strtol(argv[3], NULL, 10);

    size_t size = 1ull << sshift;
    uint64_t smask = (1ull << sshift) - 1;

    uint8_t * buf = malloc(size);
    if(!buf) {
        fprintf(stderr, "no mem");
        exit(1);
    }

    // Seed PRNG with magic values
    s[0] = 0x0BE38E2AC;
    s[1] = 0x23D933C53;
    s[2] = 0xE72482E32;
    s[3] = 0x35C339D23;

    for(size_t i = 0; i < size; i++) {
        buf[i] = next();
    }

    struct timespec start, finish;
    double elapsed;

    clock_gettime(CLOCK_MONOTONIC, &start);

    struct xor_args * args = malloc(sizeof(struct xor_args) * threads);

    int s;
    for(int i = 0; i < threads; i++) {
        struct xor_args * arg = &args[i];

        arg->iters = calls / (uint64_t) threads;

        if(i == 0) {
            args->iters += calls % threads;
        }

        arg->smask = smask;
        arg->buf = buf;

        s = pthread_create(&arg->id, NULL, (void * (*)(void *)) xor_worker, arg);
        if(s != 0) {
            fprintf(stderr, "pthread_create() failed");
            exit(1);
        }
    }

    uint8_t val = 0;
    for(int i = 0; i < threads; i++) {
        struct xor_args * arg = &args[i];
        s = pthread_join(arg->id, NULL);
        if(s != 0) {
            fprintf(stderr, "pthread_join() failed");
            exit(1);
        }

        val ^= arg->res;
    }

    clock_gettime(CLOCK_MONOTONIC, &finish);

    elapsed = (finish.tv_sec - start.tv_sec);
    elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;

    printf("time: %.3f\n", elapsed);
    printf("calls: %ld\n", calls);
    printf("time/call: %.3e\n", elapsed / calls);
    printf("M T / sec: %.3e\n", calls / elapsed / 1e6);
    printf("val: %d\n", val); // ensure val is not optimized out
}

I would assume the latency is more significant than the MT/s here since it's random access. https://www.eatyourbytes.com/ddr4-ram-data-transfers-speed-latency-and-voltage/ will give a bit of an explainer. — Joachim Isaksson, Dec 25 '22 at 10:12
I just took some data on how the clock speed affects execution. It seems like execution time decreases with higher clock speeds. At 1.6 GHz, I measured an execution time of 26.43 seconds. At 2.8 GHz, I measured 14.73 seconds. It seems like the program becomes more efficient as the clock speed increases. This is interesting because perf verified that 78% of the program overhead was in the memory fetch at 1.6 GHz. The memory accesses somehow seem to be parallelized better at higher clock frequencies. — Moss Richardson, Dec 25 '22 at 10:50
Branch condition at the end of each loop stalls the execution pipeline. Unroll a few loop iterations and see what effect is. — stark, Dec 25 '22 at 12:22
You could try running many processes at the same time to benefit from latency hiding. Why would you run a benchmarking program with nice? — Simon Goater, Dec 25 '22 at 13:05
@EricPostpischil Yes, that would definitely increase the throughput. I'm specifically interested in benchmarking the transaction throughput and not the byte throughput. Large integer hash tables and bloom filters are both data structures which would be limited by this access pattern. — Moss Richardson, Dec 25 '22 at 18:49
@stark I didn't find timings to back this up. Is there anything I missed? — Moss Richardson, Dec 25 '22 at 23:12
@SimonGoater I added multiprocessing timings. It did speed things up but the bottleneck is in the CPU. With regards to `nice`, I read [here](https://zsmith.co/bandwidth.php) that Linux may rate limit access to the memory bus. I don't think it was actually effective in my case since it's a CPU bottleneck. — Moss Richardson, Dec 25 '22 at 23:15

score 3 · Accepted Answer · answered Dec 25 '22 at 16:43

The program is bound by the memory hierarchy. Indeed, on modern processor, accessing 1 byte from the RAM causes the whole cache line to be fetched from the RAM. This means 64 bytes on your processor. One channel of DDR4-2933 RAM can reach a theoretical bandwidth of 8*2933e6/1024**3 = 21.85 GiB/s. However, regarding the fact that the cache is useless here and that 64 bytes are fetched, the maximum throughput is 21.85/64*1024 = 350 MiB/s. Based on this, the program should at least take 1000000000/(350*1024**2) = 2.72 s.

In practice, no modern processor can fully saturate a DDR4 memory (especially using only 1 core). Modern RAM are designed to be fast mainly for contiguous access, not random one, despite the name. This is because speeding up random accesses is very hard and reading contiguous data is frequent (and far simpler to optimize at the hardware level).

The main problem with DDR RAM is its latency which is huge since several decades (50-120 ns per cache line fetch). It has not changed much since the last decade while processors have become significantly faster. As a result, it is critical to be able to mitigate this latency by sending a lot of simultaneous requests, that is increasing the concurrency. However, the execution flow is sequential and the number of simultaneous in-flight requests that 1 core can achieve is limited. Modern processors can execution few instructions in parallel from a sequential program as long as the instructions to be executed are independent. The thing is the program is pretty sequential due to the random number generator though multiple instructions can still be executed in parallel.

Multiple read requests can be done concurrently, but the size of the buffers to receive data from the RAM and the caches is limited. Intel processor units dedicated for that is the line-fill buffer (LFB) between the L1 and L2 cache, the super-queue between the L2 and the L3, and the integrated memory controller (iMC) between the L3 and the RAM. You can use perf to check which one is the bottleneck. AFAIK, the LFB has about 12-14 slots on your target processor so it should not be a bottleneck here. The super-queue can read 32-bytes/cycle, that is, 1 cache-line every 2 cycle, or 1 byte of buf every 2 cycles. That being said the L3 latency is pretty big: at least about a dozen of nanoseconds. In general, prefetching units are used to mitigate the latency but random accesses are unpredictable so hardware prefetchers are completely useless in this case.

Virtual memory make random accesses slower. Indeed, when a cache-line is fetched, the processor needs to translate a virtual address to a physical one. This is done by the Translation Lookaside Buffer (TLB) unit of a processor core. It is basically a cache able to translate the address of a virtual memory page to a physical one. There are multiple TLB units for different caches and the size of the cache is dependent of the size of the pages. Pages are typically 4K ones on your processor unless your OS decides to use huge pages in this specific case. Assuming its does not, the biggest TLB (STLB) has 1536 entries for 4K pages. This means it cannot be efficient for random accesses done on a 1536*4K = 6 MiB buffer. Beyond this limit, the number of TLB miss will be frequent. When a TLB miss happen, the processor needs to call special kernel functions so to know how to translate the virtual addresses based on kernel data structures. This operation is very expensive (like any kernel calls in general) so it introduces a significant additional latency. If huge pages are used, then the threshold is 3 GiB (for pages of 2 MiB) or even 16 GiB (for 1 GiB pages). Using huge pages can thus significantly speed things up compared to basic 4K pages.

Assuming the processor has large enough buffers and TLB caches so not to limit the concurrency, the DDR4-SDRAM are typically the main bottleneck. Indeed, it does not only has a huge latency, but it is also optimized for contiguous accesses. Indeed, such RAM devices are split in banks so to reduce the device latency. Contiguous memory requests can be efficiently managed by multiple banks, but random requests are often significantly less efficient due to bank-conflict: when 2 requests are assigned to the same banks, they are processed serially rather than concurrently. While random requests tends to to be spread amongst banks, this load-balancing is not as good as contiguous request resulting in a bit lower throughout.

In the end, the speed of the program is certainly latency-bound and the time taken is calls * latency / concurrency. It looks like latency / concurrency is 10-15 ns in your case.

One solution is to speed up a bit this program is to use prefetching instructions. It often help to hide the latency by telling the processor to prefetch data in the caches (or temporary buffers) early so subsequent accesses are faster since data are supposed to be closer from computing units. This optimization is brittle : data should be prefetched early enough (otherwise the processor will stall waiting due to cache misses) and data should be kept in the target caches (not be evicted by newer prefetching instructions). In your case, this is not so easy since the index is random. Prefetching relatively large chunks so to then read the values faster should help a bit. It should be slower for small arrays and faster for big ones. Note that software prefetching is not a silver-bullet : it is generally not as good as hardware prefetching (due to the limited hardware concurrency).

An alternative solution is generally to use multiple cores (thanks to multiple threads). That being said, this is not easy too here due to the (inherently sequential) random number generator (RNG) unless it is Ok to use multiple RNG. In this case, fetching full cache lines is generally the main bottleneck due to the high concurrency, the small RAM bandwidth and mainly due to the poor efficiency (1 byte used over 64). Unfortunately, there is currently no way to efficiently extract single bytes from a DDR4-SDRAM using any mainstream x86-64 processor (including yours).

For more information, please read:

What is responsible for a RAM bottleneck when performing random memory accesses?

Benchmark & Methodology

Follow-up Benchmarks

Loop unrolling:

Multi-processing

1 Answers1