How can I obtain consistently high throughput in this loop?

Question

In the course of optimising an inner loop I have come across strange performance behaviour that I'm having trouble understanding and correcting.

A pared-down version of the code follows; roughly speaking there is one gigantic array which is divided up into 16 word chunks, and I simply add up the number of leading zeroes of the words in each chunk. (In reality I'm using the popcnt code from Dan Luu, but here I picked a simpler instruction with similar performance characteristics for "brevity". Dan Luu's code is based on an answer to this SO question which, while it has tantalisingly similar strange results, does not seem to answer my questions here.)

// -*- compile-command: "gcc -O3 -march=native -Wall -Wextra -std=c99 -o clz-timing clz-timing.c" -*-
#include <stdint.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>

#define ARRAY_LEN 16

// Return the sum of the leading zeros of each element of the ARRAY_LEN
// words starting at u.
static inline uint64_t clz_array(const uint64_t u[ARRAY_LEN]) {
    uint64_t c0 = 0;
    for (int i = 0; i < ARRAY_LEN; ++i) {
        uint64_t t0;
        __asm__ ("lzcnt %1, %0" : "=r"(t0) : "r"(u[i]));
        c0 += t0;
    }
    return c0;
}

// For each of the narrays blocks of ARRAY_LEN words starting at
// arrays, put the result of clz_array(arrays + i*ARRAY_LEN) in
// counts[i]. Return the time taken in milliseconds.
double clz_arrays(uint32_t *counts, const uint64_t *arrays, int narrays) {
    clock_t t = clock();
    for (int i = 0; i < narrays; ++i, arrays += ARRAY_LEN)
        counts[i] = clz_array(arrays);
    t = clock() - t;
    // Convert clock time to milliseconds
    return t * 1e3 / (double)CLOCKS_PER_SEC;
}

void print_stats(double t_ms, long n, double total_MiB) {
    double t_s = t_ms / 1e3, thru = (n/1e6) / t_s, band = total_MiB / t_s;
    printf("Time: %7.2f ms, %7.2f x 1e6 clz/s, %8.1f MiB/s\n", t_ms, thru, band);
}

int main(int argc, char *argv[]) {
    long n = 1 << 20;
    if (argc > 1)
        n = atol(argv[1]);

    long total_bytes = n * ARRAY_LEN * sizeof(uint64_t);
    uint64_t *buf = malloc(total_bytes);
    uint32_t *counts = malloc(sizeof(uint32_t) * n);
    double t_ms, total_MiB = total_bytes / (double)(1 << 20);

    printf("Total size: %.1f MiB\n", total_MiB);

    // Warm up
    t_ms = clz_arrays(counts, buf, n);
    //print_stats(t_ms, n, total_MiB);    // (1)
    // Run it
    t_ms = clz_arrays(counts, buf, n);    // (2)
    print_stats(t_ms, n, total_MiB);

    // Write something into buf
    for (long i = 0; i < n*ARRAY_LEN; ++i)
        buf[i] = i;

    // And again...
    (void) clz_arrays(counts, buf, n);    // (3)
    t_ms = clz_arrays(counts, buf, n);    // (4)
    print_stats(t_ms, n, total_MiB);

    free(counts);
    free(buf);
    return 0;
}

The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory.

Here is the result of a typical run (compiler command is at the beginning of the source):

$ ./clz-timing 10000000
Total size: 1220.7 MiB
Time:   47.78 ms,  209.30 x 1e6 clz/s,  25548.9 MiB/s
Time:   77.41 ms,  129.19 x 1e6 clz/s,  15769.7 MiB/s

The CPU on which this was run is an "Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz" which has a turbo boost of 3.5GHz. The latency of the lzcnt instruction is 3 cycles but it has a throughput of 1 operation per second (see Agner Fog's Skylake instruction tables) so, with 8 byte words (using uint64_t) at 3.5GHz the peak bandwidth should be 3.5e9 cycles/sec x 8 bytes/cycle = 28.0 GiB/s, which is pretty close to what we see in the first number. Even at 2.6GHz we should get close to 20.8 GiB/s.

The main question I have is,

Why is the bandwidth of call (4) always so far below the optimal value(s) obtained in call (2) and what can I do to guarantee optimal performance under a majority of circumstances?

Some points regarding what I've found so far:

According to extensive analysis with perf, the problem seems to be caused by LLC cache load misses in the slow cases that don't appear in the fast case. My guess was that maybe the fact that the memory on which we're performing the calculation hadn't been initialised meant that the compiler didn't feel obliged to load any particular values into memory, but the output of objdump -d clearly shows that the same code is being run each time. It's as though the hardware prefetcher was active the first time but not the second time, but in every case this array should be the easiest thing in the world to prefetch reliably.
The "warm up" calls at (1) and (3) are consistently as slow as the second printed bandwidth corresponding to call (4).
I've obtained much the same results on my desktop machine ("Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz").
Results were essentially the same between GCC 4.9, 7.0 and Clang 4.0. All tests run on Debian testing, kernel 4.14.
All of these results and observations can also be obtained with clz_array replaced by builtin_popcnt_unrolled_errata_manual from the Dan Luu post, mutatis mutandis.

Any help would be most appreciated!

`popcnt` has a false dependency on the destination register on Intel CPUs, but `lzcnt` doesn't (on Skylake). [`lzcnt`/`tzcnt` have the same false dep as `popcnt` on Broadwell and earlier](https://stackoverflow.com/questions/21390165/why-does-breaking-the-output-dependency-of-lzcnt-matter). Are all your test results actually from this code that uses `lzcnt`, not `popcnt`? — Peter Cordes, Feb 26 '18 at 05:25
@PeterCordes I did all the testing and analysis on the Core i7-6700HQ and all my comments pertain to that (hence Skylake). The _performance numbers_ for the Xeon E5-2620 were similar (admittedly I never achieved the full TurboBoost bandwidth) but I didn't dig into the assembly produced for the Xeon like I did for the Core i7. — Hamish, Feb 26 '18 at 05:37
Using inline asm for `lzcnt` is usually worse than using [`__builtin_clz`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) (compile with `-march=skylake` or whatever to be sure it uses `lzcnt` instead of `bsr` if you care.) https://gcc.gnu.org/wiki/DontUseInlineAsm. — Peter Cordes, Feb 26 '18 at 05:45
Some of the -O3 optimizations are a little strange in this example, too. Neat question! — Shalom Craimer, Feb 26 '18 at 06:04
BTW, if your real problem is to `popcnt` a large array in memory, AVX2 `vpshufb` 4-bit LUT can win vs. 64-bit `popcnt` (especially on Intel CPUs, where `popcnt` throughput is limited to 1 per clock (vs 2 loads per clock), and summing the results requires scalar code instead of `vpaddb` with an occasional `vpsadbw` / `vpaddq`). See **http://0x80.pl/articles/sse-popcount.html**. The difference is even bigger [with AVX512BW and `vpernlogd` for Harley-Seal](https://stackoverflow.com/questions/45376006/large-0-1-matrix-multiplication-using-bitwise-and-and-popcount-instead-of-actu/45392171). — Peter Cordes, Mar 01 '18 at 02:59

Peter Cordes · Accepted Answer · 2018-02-26T08:09:16.093

The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory

Uninitialized memory that malloc gets from the kernel with mmap is all initially copy-on-write mapped to the same physical page of all zeros.

So you get TLB misses but not cache misses. If it used a 4k page, then you get L1D hits. If it used a 2M hugepage, then you only get L3 (LLC) hits, but that's still significantly better bandwidth than DRAM.

Single-core memory bandwidth is often limited by max_concurrency / latency, and often can't saturate DRAM bandwidth. (See Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?, and the "latency-bound platforms" section of this answer for more about this in; it's much worse on many-core Xeon chips than on quad-core desktop/laptops.)

Your first warm-up run will suffer from page faults as well as TLB misses. Also, on a kernel with Meltdown mitigation enabled, any system call will flush the whole TLB. If you were adding extra print_stats to show the warm-up run performance, that would have made the run after slower.

You might want to loop multiple times over the same memory inside a timing run, so you don't need so many page-walks from touching so much virtual address space.

clock() is not a great way to measure performance. It records time in seconds, not CPU core clock cycles. If you run your benchmark long enough, you don't need really high precision, but you would need to control for CPU frequency to get accurate results. Calling clock() probably results in a system call, which (with Meltdown and Spectre mitigation enabled) flushes TLBs and branch-prediction. It may be slow enough for Skylake to clock back down from max turbo. You don't do any warm-up work after that, and of course you can't because anything after the first clock() is inside the timed interval.

Something based on wall-clock time which can use RDTSC as a timesource instead of switching to kernel mode (like gettimeofday()) would be lower overhead, although then you'd be measuring wall-clock time instead of CPU time. That's basically equivalent if the machine is otherwise idle so your process doesn't get descheduled.

For something that wasn't memory-bound, CPU performance counters to count core clock cycles can be very accurate, and without the inconvenience of having to control for CPU frequency. (Although these days you don't have to reboot to temporarily disable turbo and set the governor to performance.)

But with memory-bound stuff, changing core frequency changes the ratio of core to memory, making memory faster or slower relative to the CPU.

How can I obtain consistently high throughput in this loop?

1 Answers1

Linked