PAPI data cache miss inconsistencies in multithreaded benchmark

Question

I have a problem regarding the consistency of performance counters in my benchmark application. Namely, the reported number of L1 data cache misses exceeds the L1 data cache loads occasionally and is not what I expect. The benchmark basically fills an array with random numbers, and I (try to) measure the L1 data cache loads and misses reading the array with multiple threads.

The Code (C++17):

extern "C" {
#include <papi.h>
#include <pthread.h>
}

#include <algorithm>
#include <atomic>
#include <iostream>
#include <random>
#include <sstream>
#include <thread>
#include <vector>
#include <memory>

int main(int argc, char* argv[]) {
    if (int ret = PAPI_library_init(PAPI_VER_CURRENT);
        ret != PAPI_VER_CURRENT) {
        std::cerr << "Error initializing PAPI\n";
        return 1;
    }
    if (int ret = PAPI_thread_init((unsigned long (*)(void))(pthread_self));
        ret != PAPI_OK) {
        std::cerr << "Error initializing threads for PAPI\n";
        return 1;
    }

    // Read the number of threads from cmdline (unchecked)
    int num_threads;
    std::stringstream buf(argv[1]);
    buf >> num_threads;

    // 0-Initialize a vector with 2^28 elements and fill it with random numbers
    std::clog << "Initializing vector\n";
    std::vector<unsigned> v(1ul << 28);
    std::default_random_engine rng{};
    std::uniform_int_distribution<unsigned> dist(0, 100);
    std::clog << "Generating data\n";
    std::generate(v.begin(), v.end(), [&dist, &rng]() { return dist(rng); });
    std::vector<std::thread> threads(num_threads);
    auto counter = std::make_unique<long long[][2]>(num_threads);
    // Dummy to enforce reading the array
    std::atomic_uint total{0};
    std::clog << "Starting threads\n";
    for (int i = 0; i < num_threads; ++i) {
        threads[i] = std::thread{
            [&total, &v](int t, int from, int to,
                         long long cnt[2]) {
                // Pin thread to physical core
                cpu_set_t cpuset;
                CPU_ZERO(&cpuset);
                CPU_SET(t, &cpuset);
                pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t),
                                       &cpuset);
                int event_set = PAPI_NULL;
                if (int ret = PAPI_create_eventset(&event_set);
                    ret != PAPI_OK) {
                    std::abort();
                }
                if (int ret = PAPI_add_event(event_set, PAPI_L1_DCA);
                    ret != PAPI_OK) {
                    std::abort();
                }
                if (int ret = PAPI_add_named_event(
                        event_set, "perf::L1-DCACHE-LOAD-MISSES");
                    ret != PAPI_OK) {
                    std::abort();
                }
                if (int ret = PAPI_start(event_set); ret != PAPI_OK) {
                    std::abort();
                }

                // Do the "work"
                unsigned sum = 0;
                for (int i = from; i < to; ++i) {
                    sum += v[i];
                }

                if (int ret = PAPI_stop(event_set, cnt);
                    ret != PAPI_OK) {
                    std::abort();
                }
                // Write to dummy
                total.fetch_add(sum);
            },
            i * 2 /* On my setup, thread (0,1), (1,2) etc. are on the same core */,
            i * (v.size() / num_threads),
            (i + 1) * (v.size() / num_threads), counter[i]};
    }
    for (auto& t : threads) {
        t.join();
    }
    std::clog << "Elements: " << v.size() << '\n';
    std::clog << "Sum: " << total.load() << '\n';
    std::clog << "Different cache lines: " << v.size() * sizeof(int) / 64
              << '\n';

    for (int i = 0; i < num_threads; ++i) {
        std::clog << "\nAccesses " << i << ": " << counter[i][0] << '\n';
        std::clog << "Elements per access " << i << ": "
                  << static_cast<double>(v.size() / num_threads) /
                         static_cast<double>(counter[i][0])
                  << '\n';
        std::clog << "Misses " << i << ": " << counter[i][1] << '\n';
        std::clog << "Elements per miss " << i << ": "
                  << static_cast<double>(v.size() / num_threads) /
                         static_cast<double>(counter[i][1])
                  << '\n';
    }
}

Compiled with c++ -O3 -lpapi -pthread benchmark.cpp. Note that I'm using a AMD Ryzen CPU where the event PAPI_L1_DCM is not available. This is the output of lscpu:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 7 PRO 4750U with Radeon Graphics
    CPU family:          23
    Model:               96
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    [...]
Caches (sum of all):
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    4 MiB (8 instances)
  L3:                    8 MiB (2 instances)
[...]

Most of the time, the result is what I would expect, as each 128-bit load loads 4 4-byte integers and a cache line with 64 bytes holds 16 integers:

$ ./benchmark 4
[...]
Accesses 0: 16786686
Elements per access 0: 3.99774
Misses 0: 4196679
Elements per miss 0: 15.9909

Accesses 1: 16787632
Elements per access 1: 3.99752
Misses 1: 4199753
Elements per miss 1: 15.9792

Accesses 2: 16787441
Elements per access 2: 3.99756
Misses 2: 4199612
Elements per miss 2: 15.9798

Accesses 3: 16787914
Elements per access 3: 3.99745
Misses 3: 4199633
Elements per miss 3: 15.9797

Sometimes, however, the result looks like this:

$ ./benchmark 4
[...]
Accesses 0: 16787790
Elements per access 0: 3.99748
Misses 0: 4197774
Elements per miss 0: 15.9868

Accesses 1: 16787350
Elements per access 1: 3.99759
Misses 1: 4200144
Elements per miss 1: 15.9778

Accesses 2: 16787652
Elements per access 2: 3.99751
Misses 2: 4200045
Elements per miss 2: 15.9781

Accesses 3: 16787064
Elements per access 3: 3.99765
Misses 3: 54986718
Elements per miss 3: 1.22046

Thread 3 counts 4x more misses than accesses which I cannot explain. If I use perf stat to measure the L1-dcache-loads and L1-dcache-load-misses I get consistent (and expected) results. I also cannot reproduce this on Intel CPUs using the preset event PAPI_L1_DCM. Is my methodology wrong or is there an explanation for this behavior?

Possibly OT, but I believe this is wrong: `std::vector`. You cannot put C-style arrays into a vector. Relevant question: https://stackoverflow.com/q/4612273/580083. — Daniel Langr, Jul 12 '22 at 13:50
While standard c++17 requires the value type to be erasable (which a C-style array is not), in order to destruct the elements, at least libstdc++ implements the destruction of the elements as no-op if the destructor is trivial and `std::allocator` is used, so at least it compiles/works as expected there. — fresapore, Jul 12 '22 at 15:18
But thanks for the hint, I will change the code to be conformant. — fresapore, Jul 13 '22 at 07:10
I can also see an additional issue with the code. `cnt` is a lambda parameter of C-array type, but you are then calling `cnt->data()` inside the lambda. How is it possible? Are you sure you copied-pasted the right code here? — Daniel Langr, Jul 13 '22 at 10:13
True, apparently something went wrong there. I've edited the code accordingly, thanks! — fresapore, Jul 13 '22 at 11:18

PAPI data cache miss inconsistencies in multithreaded benchmark

0 Answers0