Using perf_event with the ARM PMU inside gem5

Question

I know that the ARM PMU is partially implemented, thanks to the gem5 source code and some publications.

I have a binary which uses perf_event to access the PMU on a Linux-based OS, under an ARM processor. Could it use perf_event inside a gem5 full-system simulation with a Linux kernel, under the ARM ISA?

So far, I haven't found the right way to do it. If someone knows, I will be very grateful!

Hi Pierre, are you hitting gem5 specific problems or is it a more general question on how to use them? If possible, try to first deal with the more general question on a more stable/fast simulator like QEMU/real hardware, and only then come to gem5 (which slower and generally buggier). Not sure QEMU supports it though, but worth checking. — Ciro Santilli, Sep 21 '20 at 11:16
Hi, thanks for your answer. Maybe I should have been more precise. Currently, I have a binary (developed by myself) which uses perf_event on real ARM hardware, to get cache misses and mispredicted branches, and it works well. My perf_event_attr.type is configured with PERF_TYPE_HARDWARE and the .config field with PERF_COUNT_HW_CACHE_MISSES and another with PERF_COUNT_HW_BRANCH_MISSES. However, when i put this binary on a gem5 fs simulation, configured with the DerivO3CPU, ArmSystem, and RealView platform, I got the following error "ENOENT (2): No such file or directory" — Pierre Ayoub, Sep 21 '20 at 11:26
I don't know if we can access the PMU through perf_event into gem5. If so, maybe we have to use RAW events? In the gem5 example code under configs, I have found a snippet in devices.py which "Instantiates 1 ArmPMU per PE" (addPMUs()). However, after few tries, I don't understand how to use this and how it is related to perf_event. — Pierre Ayoub, Sep 21 '20 at 11:28
OK, thanks for clarifying. So the based on ENOENT, the `perf_event` file is not being created by the Linux kernel is that it? I'll ask around. — Ciro Santilli, Sep 21 '20 at 14:21
The `perf_event` file... descriptor, you mean? Yes, it is **not** created by the kernel (equal to `-1`). I wish to precise that this error arrives at the return of the `perf_event_open()` _syscall_. Finally, this error is documented in the `perf_event_open.2` _manpage_, and also discussed [here](https://stackoverflow.com/questions/47829826/perf-event-open-can-not-open-more-than-7-fd) — Pierre Ayoub, Sep 21 '20 at 14:27
Just a temporary reply hopefully until I get better replies, could you try to just patch the scripts to call `addPMUs`/do exactly what it does on CPUs? And pass a PPI interrupt (<31 and free according to the RealView.py file interrupt map). The `events` argument is optional, and all architectural events should be available once `addArchEvents` gets called. — Ciro Santilli, Sep 21 '20 at 16:26
@Ciro I used a code similar to `addPMUs()` in `devices.py`, with interrupts number 20, 21, 22, and 23 (one by core) according to the _RealView_ interrupts mapping, with the `ArmPPI` class. However, `perf_event_open()` still return the same error. Note that I got this message during the boot, from `src/arch/arm/pmu.cc:293`: `warn: Not doing anything for write to miscreg pmuserenr_el0`. This register is documented in the _ARMv8-A architecture manual_. Do you know if `perf_event` is supposed to be initialized with `PERF_EVENT_HARDWARE` or `PERF_EVENT_RAW`, to be used with _gem5_? — Pierre Ayoub, Sep 22 '20 at 11:32
@Ciro With `--debug-flags=PMUVerbose`, I get the following: `0: system.cpu_cluster.cpus0.isa.pmu: Initializing the PMU.` [...] `0: system.cpu_cluster.cpus0.isa.pmu: PMU: Adding Probe Driven event with id '0x2'as probe system.cpu_cluster.cpus0.itb:Refills` [...] `8687351673751: system.cpu_cluster.cpus0.isa.pmu: Assigning PMU to ContextID 0.` [...] `8687351673751: system.cpu_cluster.cpus0.isa.pmu: updateCounter(31): Disabling counter` [...] — Pierre Ayoub, Sep 22 '20 at 12:26
OK. I'm afraid I don't know much about the PMU events and how they are exposed to Linux :-( I'll let you know if anyone replies to me, and if you manage to progress, do make an answer. It would also be amazing if you could share a minimal C program for reproduction, even though it is supposedly not hard to find one online. — Ciro Santilli, Sep 22 '20 at 20:51
Thanks for your help anyway, I appreciate. I will try to ask on the _gem5_ mailing list. If I find something that works, I will post it here for sure. ;) — Pierre Ayoub, Sep 22 '20 at 22:14

score 2 · Accepted Answer · edited Nov 18 '20 at 14:41

Context

I was not able to use the Performance Monitoring Unit (PMU) because of a gem5's unimplemented feature. The reference on the mailing list can be found here. After a personal patch, the PMU is accessible through perf_event. Fortunately, a similar patch will be released in the official gem5 release soon, could be seen here. The patch will be described in another answer, due to the number of link limitation inside one message.

How to use the PMU

C source code

This is a minimal working example of a C source code using perf_event, used to count the number of mispredicted branches by the branch predictor unit during a specific task:

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <errno.h>

#include <unistd.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>

int main(int argc, char **argv) {
    /* File descriptor used to read mispredicted branches counter. */
    static int perf_fd_branch_miss;
    
    /* Initialize our perf_event_attr, representing one counter to be read. */
    static struct perf_event_attr attr_branch_miss;
    attr_branch_miss.size = sizeof(attr_branch_miss);
    attr_branch_miss.exclude_kernel = 1;
    attr_branch_miss.exclude_hv = 1;
    attr_branch_miss.exclude_callchain_kernel = 1;
    /* On a real system, you can do like this: */
    attr_branch_miss.type = PERF_TYPE_HARDWARE;
    attr_branch_miss.config = PERF_COUNT_HW_BRANCH_MISSES;
    /* On a gem5 system, you have to do like this: */
    attr_branch_miss.type = PERF_TYPE_RAW;
    attr_branch_miss.config = 0x10;
    
    /* Open the file descriptor corresponding to this counter. The counter
       should start at this moment. */
    if ((perf_fd_branch_miss = syscall(__NR_perf_event_open, &attr_branch_miss, 0, -1, -1, 0)) == -1)
        fprintf(stderr, "perf_event_open fail %d %d: %s\n", perf_fd_branch_miss, errno, strerror(errno));
    
    /* Workload here, that means our specific task to profile. */

    /* Get and close the performance counters. */
    uint64_t counter_branch_miss = 0;
    read(perf_fd_branch_miss, &counter_branch_miss, sizeof(counter_branch_miss));
    close(perf_fd_branch_miss);

    /* Display the result. */
    printf("Number of mispredicted branches: %d\n", counter_branch_miss);
}

I will not enter into the details of how using perf_event, good resources are available here, here, here, here. However, just a few notes about the code above:

On real hardware, when using perf_event and common events (events that are available under a lot of architectures), it is recommended to use perf_event macros PERF_TYPE_HARDWARE as type and to use macros like PERF_COUNT_HW_BRANCH_MISSES for the number of mispredicted branches, PERF_COUNT_HW_CACHE_MISSES for the number of cache misses, and so on (see the manual page for a list). This is a best practice to have a portable code.
On a gem5 simulated system, currently (v20.0), a C source code have to use PERF_TYPE_RAW type and architectural event ID to identify an event. Here, 0x10 is the ID of the 0x0010, BR_MIS_PRED, Mispredicted or not predicted branch event, described in the ARMv8-A Reference Manual (here). In the manual, all events available in real hardware are described. However, they are not all implemented into gem5. To see the list of implemented event inside gem5, refer to the src/arch/arm/ArmPMU.py file. In the latter, the line self.addEvent(ProbeEvent(self,0x10, bpred, "Misses")) corresponds to the declaration of the counter described in the manual. This is not a normal behavior, hence gem5 should be patched to allow using PERF_TYPE_HARDWARE one day.

gem5 simulation script

This is not a entire MWE script (it would be too long!), only the needed portion to add inside a full-system script to use the PMU. We use an ArmSystem as a system, with the RealView platform.

For each ISA (we use an ARM ISA here) of each CPU (e.g., a DerivO3CPU) in our cluster (which is a SubSystem class), we add to it a PMU with a unique interrupt number and the already implemented architectural event. An example of this function could be found in configs/example/arm/devices.py.

To choose an interrupt number, pick a free PPI interrupt in the platform interrupt mapping. Here, we choose PPI n°20, according to the RealView interrupt map (src/dev/arm/RealView.py). Since PPIs interrupts are local per Processing Element (PE, corresponds to cores in our context), the interrupt number can be the same for all PE without any conflict. To know more about PPI interrupts, see the GIC guide from ARM here.

Here, we can see that the interrupt n°20 is not used by the system (from RealView.py):

Interrupts:
      0- 15: Software generated interrupts (SGIs)
     16- 31: On-chip private peripherals (PPIs)
        25   : vgic
        26   : generic_timer (hyp)
        27   : generic_timer (virt)
        28   : Reserved (Legacy FIQ)

We pass to addArchEvents our system components (dtb, itb, etc.) to link the PMU with them, thus the PMU will use the internal counters (called probes) of these components as exposed counters to the system.

for cpu in system.cpu_cluster.cpus:
    for isa in cpu.isa:
        isa.pmu = ArmPMU(interrupt=ArmPPI(num=20))
        # Add the implemented architectural events of gem5. We can
        # discover which events is implemented by looking at the file
        # "ArmPMU.py".
        isa.pmu.addArchEvents(
            cpu=cpu, dtb=cpu.dtb, itb=cpu.itb,
            icache=getattr(cpu, "icache", None),
            dcache=getattr(cpu, "dcache", None),
            l2cache=getattr(system.cpu_cluster, "l2", None))

Pierre Ayoub · Answer 2 · 2020-11-12T06:35:06.047

As of September 2020, gem5 needs to be patched in order to use the ARM PMU.

Edit: As of November 2020, gem5 is now patched and it will be included in the next release. Thanks to the developers!

How to patch gem5

This is not a clean patch (very straightforward), and it is more intended to understand how it works. Nonetheless, this is the patch to apply with git apply from the gem5 source repository:

diff --git i/src/arch/arm/ArmISA.py w/src/arch/arm/ArmISA.py
index 2641ec3fb..3d85c1b75 100644
--- i/src/arch/arm/ArmISA.py
+++ w/src/arch/arm/ArmISA.py
@@ -36,6 +36,7 @@
from m5.params import *
from m5.proxy import *

+from m5.SimObject import SimObject
from m5.objects.ArmPMU import ArmPMU
from m5.objects.ArmSystem import SveVectorLength
from m5.objects.BaseISA import BaseISA
@@ -49,6 +50,8 @@ class ArmISA(BaseISA):
cxx_class = 'ArmISA::ISA'
cxx_header = "arch/arm/isa.hh"

+    generateDeviceTree = SimObject.recurseDeviceTree
+
system = Param.System(Parent.any, "System this ISA object belongs to")

pmu = Param.ArmPMU(NULL, "Performance Monitoring Unit")
diff --git i/src/arch/arm/ArmPMU.py w/src/arch/arm/ArmPMU.py
index 047e908b3..58553fbf9 100644
--- i/src/arch/arm/ArmPMU.py
+++ w/src/arch/arm/ArmPMU.py
@@ -40,6 +40,7 @@ from m5.params import *
from m5.params import isNullPointer
from m5.proxy import *
from m5.objects.Gic import ArmInterruptPin
+from m5.util.fdthelper import *

class ProbeEvent(object):
def __init__(self, pmu, _eventId, obj, *listOfNames):
@@ -76,6 +77,17 @@ class ArmPMU(SimObject):

_events = None

+    def generateDeviceTree(self, state):
+        node = FdtNode("pmu")
+        node.appendCompatible("arm,armv8-pmuv3")
+        # gem5 uses GIC controller interrupt notation, where PPI interrupts
+        # start to 16. However, the Linux kernel start from 0, and used a tag
+        # (set to 1) to indicate the PPI interrupt type.
+        node.append(FdtPropertyWords("interrupts", [
+            1, int(self.interrupt.num) - 16, 0xf04
+        ]))
+        yield node
+
def addEvent(self, newObject):
if not (isinstance(newObject, ProbeEvent)
or isinstance(newObject, SoftwareIncrement)):
diff --git i/src/cpu/BaseCPU.py w/src/cpu/BaseCPU.py
index ab70d1d7f..66a49a038 100644
--- i/src/cpu/BaseCPU.py
+++ w/src/cpu/BaseCPU.py
@@ -302,6 +302,11 @@ class BaseCPU(ClockedObject):
node.appendPhandle(phandle_key)
cpus_node.append(node)

+        # Generate nodes from the BaseCPU children (and don't add them as
+        # subnode). Please note: this is mainly needed for the ISA class.
+        for child_node in self.recurseDeviceTree(state):
+            yield child_node
+
yield cpus_node

def __init__(self, **kwargs):

What the patch resolves

The Linux kernel uses a Device Tree Blob (DTB), which is a regular file, to declare the hardware on which the kernel is running. This is used to make the kernel portable between different architecture without a recompilation for each hardware change. The DTB follows the Device Tree Reference, and is compiled from a Device Tree Source (DTS) file, a regular text file. You can learn more here and here.

The problem was that the PMU is supposed to be declared to the Linux kernel via the DTB. You can learn more here and here. In a simulated system, because the system is specified by the user, gem5 has to generate a DTB itself to pass to the kernel, so the latter can recognize the simulated hardware. However, the problem is that gem5 does not generate the DTB entry for our PMU.

What the patch does

The patch adds an entry to the ISA and the CPU files to enable DTB generation recursion up to find the PMU. The hierarchy is the following: CPU => ISA => PMU. Then, it adds the generation function in the PMU to generate a unique DTB entry to declare the PMU, with the proper notation for the interrupt declaration in the kernel.

After running a simulation with our patch, we could see the DTS from the DTB like this:

cd m5out    
# Decompile the DTB to get the DTS.
dtc -I dtb -O dts system.dtb > system.dts
# Find the PMU entry.
head system.dts

dtc is the Device Tree Compiler, installed with sudo apt-get install device-tree-compiler. We end up with this pmu DTB entry, under the root node (/):

/dts-v1/;

/ {
    #address-cells = <0x02>;
    #size-cells = <0x02>;
    interrupt-parent = <0x05>;
    compatible = "arm,vexpress";
    model = "V2P-CA15";
    arm,hbi = <0x00>;
    arm,vexpress,site = <0x0f>;

    memory@80000000 {
        device_type = "memory";
        reg = <0x00 0x80000000 0x01 0x00>;
    };

    pmu {
        compatible = "arm,armv8-pmuv3";
        interrupts = <0x01 0x04 0xf04>;
    };

    cpus {
        #address-cells = <0x01>;
        #size-cells = <0x00>;

        cpu@0 {
            device_type = "cpu";
            compatible = "gem5,arm-cpu";

[...]

In the line interrupts = <0x01 0x04 0xf04>;, 0x01 is used to indicate that the number 0x04 is the number of a PPI interrupt (the one declared with number 20 in gem5, the difference of 16 is explained inside the patch code). The 0xf04 corresponds to a flag (0x4) indicating that it is a "active high level-sensitive" interrupt and a bit mask (0xf) indicating that the interrupts should be wired to all PE attached to the GIC. You can learn more here.

If the patch works and your ArmPMU is declared properly, you should see this message at boot time:

  [    0.239967] hw perfevents: enabled with armv8_pmuv3 PMU driver, 32 counters available

Thank you very much for the valuable information, and thanks to the Gem5 developers. When the patched version will be released? I have the stable version (v20.1.0.0 / October 1, 2020), however, I ran in the same problem and till now, I had not find solve it, i.e., I can't access HPCs in simulated ARM in Gem5. — husin alhaj ahmade, Nov 24 '20 at 17:47

score 0 · Answer 3 · answered Nov 29 '20 at 08:29

Two quick additions to Pierre's awesome answers:

for fs.py as of gem5 937241101fae2cd0755c43c33bab2537b47596a2, all that is missing is to apply to fs.py as shown at: https://gem5-review.googlesource.com/c/public/gem5/+/37978/1/configs/example/fs.py

for  cpu in test_sys.cpu:
    if buildEnv['TARGET_ISA'] in "arm":
        for isa in cpu.isa:
            isa.pmu = ArmPMU(interrupt=ArmPPI(num=20))
            isa.pmu.addArchEvents(
                cpu=cpu, dtb=cpu.mmu.dtb, itb=cpu.mmu.itb,
                icache=getattr(cpu, "icache", None),
                dcache=getattr(cpu, "dcache", None),
                l2cache=getattr(test_sys, "l2", None))

a C example can also be found in man perf_event_open