In the course of optimising an inner loop I have come across strange performance behaviour that I'm having trouble understanding and correcting.
A pared-down version of the code follows; roughly speaking there is one gigantic array which is divided up into 16 word chunks, and I simply add up the number of leading zeroes of the words in each chunk. (In reality I'm using the popcnt
code from Dan Luu, but here I picked a simpler instruction with similar performance characteristics for "brevity". Dan Luu's code is based on an answer to this SO question which, while it has tantalisingly similar strange results, does not seem to answer my questions here.)
// -*- compile-command: "gcc -O3 -march=native -Wall -Wextra -std=c99 -o clz-timing clz-timing.c" -*-
#include <stdint.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#define ARRAY_LEN 16
// Return the sum of the leading zeros of each element of the ARRAY_LEN
// words starting at u.
static inline uint64_t clz_array(const uint64_t u[ARRAY_LEN]) {
uint64_t c0 = 0;
for (int i = 0; i < ARRAY_LEN; ++i) {
uint64_t t0;
__asm__ ("lzcnt %1, %0" : "=r"(t0) : "r"(u[i]));
c0 += t0;
}
return c0;
}
// For each of the narrays blocks of ARRAY_LEN words starting at
// arrays, put the result of clz_array(arrays + i*ARRAY_LEN) in
// counts[i]. Return the time taken in milliseconds.
double clz_arrays(uint32_t *counts, const uint64_t *arrays, int narrays) {
clock_t t = clock();
for (int i = 0; i < narrays; ++i, arrays += ARRAY_LEN)
counts[i] = clz_array(arrays);
t = clock() - t;
// Convert clock time to milliseconds
return t * 1e3 / (double)CLOCKS_PER_SEC;
}
void print_stats(double t_ms, long n, double total_MiB) {
double t_s = t_ms / 1e3, thru = (n/1e6) / t_s, band = total_MiB / t_s;
printf("Time: %7.2f ms, %7.2f x 1e6 clz/s, %8.1f MiB/s\n", t_ms, thru, band);
}
int main(int argc, char *argv[]) {
long n = 1 << 20;
if (argc > 1)
n = atol(argv[1]);
long total_bytes = n * ARRAY_LEN * sizeof(uint64_t);
uint64_t *buf = malloc(total_bytes);
uint32_t *counts = malloc(sizeof(uint32_t) * n);
double t_ms, total_MiB = total_bytes / (double)(1 << 20);
printf("Total size: %.1f MiB\n", total_MiB);
// Warm up
t_ms = clz_arrays(counts, buf, n);
//print_stats(t_ms, n, total_MiB); // (1)
// Run it
t_ms = clz_arrays(counts, buf, n); // (2)
print_stats(t_ms, n, total_MiB);
// Write something into buf
for (long i = 0; i < n*ARRAY_LEN; ++i)
buf[i] = i;
// And again...
(void) clz_arrays(counts, buf, n); // (3)
t_ms = clz_arrays(counts, buf, n); // (4)
print_stats(t_ms, n, total_MiB);
free(counts);
free(buf);
return 0;
}
The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays
function it is on uninitialised memory.
Here is the result of a typical run (compiler command is at the beginning of the source):
$ ./clz-timing 10000000
Total size: 1220.7 MiB
Time: 47.78 ms, 209.30 x 1e6 clz/s, 25548.9 MiB/s
Time: 77.41 ms, 129.19 x 1e6 clz/s, 15769.7 MiB/s
The CPU on which this was run is an "Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz" which has a turbo boost of 3.5GHz. The latency of the lzcnt
instruction is 3 cycles but it has a throughput of 1 operation per second (see Agner Fog's Skylake instruction tables) so, with 8 byte words (using uint64_t
) at 3.5GHz the peak bandwidth should be 3.5e9 cycles/sec x 8 bytes/cycle = 28.0 GiB/s
, which is pretty close to what we see in the first number. Even at 2.6GHz we should get close to 20.8 GiB/s.
The main question I have is,
Why is the bandwidth of call (4) always so far below the optimal value(s) obtained in call (2) and what can I do to guarantee optimal performance under a majority of circumstances?
Some points regarding what I've found so far:
- According to extensive analysis with
perf
, the problem seems to be caused by LLC cache load misses in the slow cases that don't appear in the fast case. My guess was that maybe the fact that the memory on which we're performing the calculation hadn't been initialised meant that the compiler didn't feel obliged to load any particular values into memory, but the output ofobjdump -d
clearly shows that the same code is being run each time. It's as though the hardware prefetcher was active the first time but not the second time, but in every case this array should be the easiest thing in the world to prefetch reliably. - The "warm up" calls at (1) and (3) are consistently as slow as the second printed bandwidth corresponding to call (4).
- I've obtained much the same results on my desktop machine ("Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz").
- Results were essentially the same between GCC 4.9, 7.0 and Clang 4.0. All tests run on Debian testing, kernel 4.14.
- All of these results and observations can also be obtained with
clz_array
replaced bybuiltin_popcnt_unrolled_errata_manual
from the Dan Luu post, mutatis mutandis.
Any help would be most appreciated!