I have a problem regarding the consistency of performance counters in my benchmark application. Namely, the reported number of L1 data cache misses exceeds the L1 data cache loads occasionally and is not what I expect. The benchmark basically fills an array with random numbers, and I (try to) measure the L1 data cache loads and misses reading the array with multiple threads.
The Code (C++17):
extern "C" {
#include <papi.h>
#include <pthread.h>
}
#include <algorithm>
#include <atomic>
#include <iostream>
#include <random>
#include <sstream>
#include <thread>
#include <vector>
#include <memory>
int main(int argc, char* argv[]) {
if (int ret = PAPI_library_init(PAPI_VER_CURRENT);
ret != PAPI_VER_CURRENT) {
std::cerr << "Error initializing PAPI\n";
return 1;
}
if (int ret = PAPI_thread_init((unsigned long (*)(void))(pthread_self));
ret != PAPI_OK) {
std::cerr << "Error initializing threads for PAPI\n";
return 1;
}
// Read the number of threads from cmdline (unchecked)
int num_threads;
std::stringstream buf(argv[1]);
buf >> num_threads;
// 0-Initialize a vector with 2^28 elements and fill it with random numbers
std::clog << "Initializing vector\n";
std::vector<unsigned> v(1ul << 28);
std::default_random_engine rng{};
std::uniform_int_distribution<unsigned> dist(0, 100);
std::clog << "Generating data\n";
std::generate(v.begin(), v.end(), [&dist, &rng]() { return dist(rng); });
std::vector<std::thread> threads(num_threads);
auto counter = std::make_unique<long long[][2]>(num_threads);
// Dummy to enforce reading the array
std::atomic_uint total{0};
std::clog << "Starting threads\n";
for (int i = 0; i < num_threads; ++i) {
threads[i] = std::thread{
[&total, &v](int t, int from, int to,
long long cnt[2]) {
// Pin thread to physical core
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(t, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t),
&cpuset);
int event_set = PAPI_NULL;
if (int ret = PAPI_create_eventset(&event_set);
ret != PAPI_OK) {
std::abort();
}
if (int ret = PAPI_add_event(event_set, PAPI_L1_DCA);
ret != PAPI_OK) {
std::abort();
}
if (int ret = PAPI_add_named_event(
event_set, "perf::L1-DCACHE-LOAD-MISSES");
ret != PAPI_OK) {
std::abort();
}
if (int ret = PAPI_start(event_set); ret != PAPI_OK) {
std::abort();
}
// Do the "work"
unsigned sum = 0;
for (int i = from; i < to; ++i) {
sum += v[i];
}
if (int ret = PAPI_stop(event_set, cnt);
ret != PAPI_OK) {
std::abort();
}
// Write to dummy
total.fetch_add(sum);
},
i * 2 /* On my setup, thread (0,1), (1,2) etc. are on the same core */,
i * (v.size() / num_threads),
(i + 1) * (v.size() / num_threads), counter[i]};
}
for (auto& t : threads) {
t.join();
}
std::clog << "Elements: " << v.size() << '\n';
std::clog << "Sum: " << total.load() << '\n';
std::clog << "Different cache lines: " << v.size() * sizeof(int) / 64
<< '\n';
for (int i = 0; i < num_threads; ++i) {
std::clog << "\nAccesses " << i << ": " << counter[i][0] << '\n';
std::clog << "Elements per access " << i << ": "
<< static_cast<double>(v.size() / num_threads) /
static_cast<double>(counter[i][0])
<< '\n';
std::clog << "Misses " << i << ": " << counter[i][1] << '\n';
std::clog << "Elements per miss " << i << ": "
<< static_cast<double>(v.size() / num_threads) /
static_cast<double>(counter[i][1])
<< '\n';
}
}
Compiled with c++ -O3 -lpapi -pthread benchmark.cpp
. Note that I'm using a AMD Ryzen CPU where the event PAPI_L1_DCM
is not available. This is the output of lscpu
:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 7 PRO 4750U with Radeon Graphics
CPU family: 23
Model: 96
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
[...]
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 4 MiB (8 instances)
L3: 8 MiB (2 instances)
[...]
Most of the time, the result is what I would expect, as each 128-bit load loads 4 4-byte integers and a cache line with 64 bytes holds 16 integers:
$ ./benchmark 4
[...]
Accesses 0: 16786686
Elements per access 0: 3.99774
Misses 0: 4196679
Elements per miss 0: 15.9909
Accesses 1: 16787632
Elements per access 1: 3.99752
Misses 1: 4199753
Elements per miss 1: 15.9792
Accesses 2: 16787441
Elements per access 2: 3.99756
Misses 2: 4199612
Elements per miss 2: 15.9798
Accesses 3: 16787914
Elements per access 3: 3.99745
Misses 3: 4199633
Elements per miss 3: 15.9797
Sometimes, however, the result looks like this:
$ ./benchmark 4
[...]
Accesses 0: 16787790
Elements per access 0: 3.99748
Misses 0: 4197774
Elements per miss 0: 15.9868
Accesses 1: 16787350
Elements per access 1: 3.99759
Misses 1: 4200144
Elements per miss 1: 15.9778
Accesses 2: 16787652
Elements per access 2: 3.99751
Misses 2: 4200045
Elements per miss 2: 15.9781
Accesses 3: 16787064
Elements per access 3: 3.99765
Misses 3: 54986718
Elements per miss 3: 1.22046
Thread 3 counts 4x more misses than accesses which I cannot explain. If I use perf stat
to measure the L1-dcache-loads
and L1-dcache-load-misses
I get consistent (and expected) results. I also cannot reproduce this on Intel CPUs using the preset event PAPI_L1_DCM
. Is my methodology wrong or is there an explanation for this behavior?