0

I have a Raspberry Pi 3 with Linux. The RPi3 has a quad-core cortex-A53, with a Performace Monitoring Unit (PMU) v3. I execute the cyclictest program to do some real-time test. Cyclcitest is an aplication where you can set the period and the number of iterations and calculates the latency. Therefore, it does some executions and after it goes to sleep until the nex period, when system wakes it up.

I want to read the cache memory values in each cyclictest execution, to see how many cache misses does it have while it is executing (I do not want the misses when the task is in sleep).

I have tried with perf stat executing:

perf stat -o result1.txt -r 10 -i -e armv8_pmuv3/l1d_cache/ 
    -e armv8_pmuv3/l1d_cache_refill/ 
    -e armv8_pmuv3/l1d_cache_wb/ 
    -e armv8_pmuv3/l1d_tlb_refill/ 
    -e armv8_pmuv3/l1i_cache/ 
    -e armv8_pmuv3/l1i_cache_refill/ 
    -e armv8_pmuv3/l1i_tlb_refill/ 
    -e armv8_pmuv3/l2d_cache/ 
    -e armv8_pmuv3/l2d_cache_refill/ 
    -e armv8_pmuv3/l2d_cache_wb/ 
    -e armv8_pmuv3/mem_access/ 
    cyclictest -l57600 -m -n -t1 -p80 -i50000 -h300 -q --histfile=666_data_50

However, it does provide the information around the 50% of the execution:

Performance counter stats for 'cyclictest -l57600 -m -n -t1 -p80 -i50000 -h300 -q --histfile=666_data_50' (10 runs):

     937729229      armv8_pmuv3/l1d_cache/                                        ( +-  2.41% )  (54.50%)
      44736600      armv8_pmuv3/l1d_cache_refill/                                     ( +-  2.33% )  (54.39%)
      44784430      armv8_pmuv3/l1d_cache_wb/                                     ( +-  2.11% )  (54.33%)
        294033      armv8_pmuv3/l1d_tlb_refill/                                     ( +- 13.82% )  (54.21%)
    1924752301      armv8_pmuv3/l1i_cache/                                        ( +-  2.37% )  (54.41%)
     120581610      armv8_pmuv3/l1i_cache_refill/                                     ( +-  2.41% )  (54.46%)
        761651      armv8_pmuv3/l1i_tlb_refill/                                     ( +-  4.87% )  (54.70%)
     215103404      armv8_pmuv3/l2d_cache/                                        ( +-  2.28% )  (54.69%)
      30884575      armv8_pmuv3/l2d_cache_refill/                                     ( +-  1.44% )  (54.83%)
      11424917      armv8_pmuv3/l2d_cache_wb/                                     ( +-  2.03% )  (54.76%)
     943041718      armv8_pmuv3/mem_access/                                       ( +-  2.41% )  (54.74%)

2904.940283006 seconds time elapsed                                          ( +-  0.07% )

I do not know if this counter only counts the cache information of this task while is running, or it also counts when it is in sleep. Does someone know? I have other applications running also, could they modify the value of these counters as I have specified in perf stat?

If it is not possible to read the exact value of the counter that just the task had in running? With a module or a custom user-space application?

Thank you!!

jww
  • 97,681
  • 90
  • 411
  • 885
iall
  • 1
  • 1
  • I think the PMU counters are Exception Level 1 (EL1) or above, so you need root privileges to do it. It seems leading anything interesting in ARM from any control-like register requires EL1 or above. You can't even read basic machine capabilities, like if the machine has optional CRC and Crypto instructions available. – jww Feb 22 '18 at 13:25
  • 2
    @jww: Linux `perf` already does all the perf-counter accesses in kernel mode. It virtualizes the HW counters so they're per-process when you use `perf stat` like that, without `-a` or some other system-wide option that would leave the counters counting while other tasks ran. `perf stat` used that way is *just* counting your process, not anything else the current CPU or other cores are doing. – Peter Cordes Feb 22 '18 at 20:25
  • @PeterCordes So, if I understood correctly, I am doing it well. However, if I execute it two time I get values quite different. Besides, if in one test I set a period of 50ms and in another one 100us, the number of l1d_cache, l1i_cache and l2d_cache is muts more lower in the second. Being the same application, it should not have similar values? – iall Feb 23 '18 at 06:39
  • I think you're doing it right, unless `perf` on ARM is very different from x86. Letting the test run for 50 ms should produce far more L1D accesses than in only 100 us, if the number of accesses per unit time is similar. It's the number of accesses, not the hit *rate*. You didn't say *what* this program is calculating the latency of. Memory accesses? Or something else? – Peter Cordes Feb 23 '18 at 06:55
  • @PeterCordes But I set the same number of iterations in bots tests. Therefore, I understand that in both tests the time it is beeing in execution is the same in both tests, the only thing that it changes is the sleeping time. Cyclictest is a rt-test application [link](https://wiki.linuxfoundation.org/realtime/documentation/howto/tools/cyclictest) – iall Feb 23 '18 at 07:08
  • 1
    If caches get polluted by other work between test-cycles (or you get scheduled on a different core after wakeup), then shorter cycles would mean more misses / lower throughput / fewer total accesses in the same time. Is your system idle? Did you look at the cpu-migrations perf event? Did you try using `taskset -c 0 perf stat ...` to pin your microbenchmark to core 0? – Peter Cordes Feb 23 '18 at 07:19
  • It is not idle, I have hackbench executing in both tests (a CPU stress). I am quite sure that the 50 ms period tests have more cache misses (I want to probe that). However, the cache-references + the misses should be more or less equal in both tests, isn´t it? – iall Feb 23 '18 at 07:33

1 Answers1

1

Every performance monitor hardware is limited by number of channels: how many events can be counted simultaneously in every moment. For example, many modern x86/x86_64 may have 4 flexible channels for every cpu core and 3 fixed channels. When you ask profiler for more events, it will multiplex (as VTune and PAPI do). When multiplexing is active, and some event e1 was measured for 55% of running time, and perf stat (but not perf record?) will extrapolate counts into full running time ("C. Multiplexing"). This extrapolation may have some error.

Your cortex-A53 with PMU v3 has only six channels: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0500d/BIIDADBH.html PMEVCNTR0_EL0 - PMEVCNTR5_EL0 and PMEVTYPER0_EL0 - PMEVTYPER5_EL0. Try to start perf stat with no more than 6 events for single run of the test to turn off event multiplexing:

perf stat -o result1.txt -r 10 -i \
    -e armv8_pmuv3/l1d_cache/  \
    -e armv8_pmuv3/l1d_cache_refill/  \
    -e armv8_pmuv3/l1d_cache_wb/  \
    -e armv8_pmuv3/l1d_tlb_refill/  \
    -e armv8_pmuv3/l1i_cache/  \
    -e armv8_pmuv3/l1i_cache_refill/  \
    cyclictest -l57600 -m -n -t1 -p80 -i50000 -h300 -q --histfile=666_data_50

perf stat -o result2.txt -r 10 -i \
    -e armv8_pmuv3/l1i_tlb_refill/  \
    -e armv8_pmuv3/l2d_cache/  \
    -e armv8_pmuv3/l2d_cache_refill/  \
    -e armv8_pmuv3/l2d_cache_wb/  \
    -e armv8_pmuv3/mem_access/  \
    cyclictest -l57600 -m -n -t1 -p80 -i50000 -h300 -q --histfile=666_data_50

You may also try to group events into sets: -e \{event1,event2...,event6\} (https://stackoverflow.com/a/48448876) and set will multiplexed with other sets.

osgx
  • 90,338
  • 53
  • 357
  • 513