1

I want to read certain performance counters. I know that there are tools like perf, that can do it for me in the user space itself, I want the code to be inside the Linux kernel.

I want to write a mechanism to monitor performance counters on Intel(R) Core(TM) i7-3770 CPU. On top of using I am using Ubuntu kernel 4.19.2. I have gotten the following method from easyperf

Here's part of my code to read instructions.

  struct perf_event_attr *attr
  memset (&pe, 0, sizeof (struct perf_event_attr));
  pe.type = PERF_TYPE_HARDWARE;
  pe.size = sizeof (struct perf_event_attr);
  pe.config = PERF_COUNT_HW_INSTRUCTIONS;
  pe.disabled = 0;
  pe.exclude_kernel = 0;
  pe.exclude_user = 0;
  pe.exclude_hv = 0;
  pe.exclude_idle = 0;

  fd = syscall(__NR_perf_event_open, hw, pid, cpu, grp, flags);

  uint64_t perf_read(int fd) {
    uint64_t val;
    int rc;
    rc = read(fd, &val, sizeof(val));
    assert(rc == sizeof(val));
    return val;
  }

I want to put the same lines in the kernel code (in the context switch function) and check the values being read.

My end goal is to figure out a way to read performance counters for a process, every time it switches to another, from the kernel(4.19.2) itself.

To achieve this I check out the code for the system call number __NR_perf_event_open. It can be found here To make to usable I copied the code inside as a separate function, named it perf_event_open() in the same file and exported.

Now the problem is whenever I call perf_event_open() in the same way as above, the descriptor returned is -2. Checking with the error codes, I figured out that the error was ENOENT. In the perf_event_open() man page, the cause of this error is defined as wrong type field.

Since file descriptors are associated to the process that's opened them, how can one use them from the kernel? Is there an alternative way to configure the pmu to start counting without involving file descriptors?

  • 2
    You don't need inline asm; gcc has a `__builtin_rdpmc(int)`. But your inline asm looks correct, so that's not going to change anything. (And beware of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87550 : before gcc6.5 / 7.4 / 8.3, that builtin left out `volatile`.) – Peter Cordes Mar 11 '19 at 11:36
  • I don't think the error is due to gcc because of two reasons. First, I am getting no such error when I compile the kernel. Second, I am using the same gcc to compile both the codes(the user-space C program and the kernel). The result from 'gcc --version' : "gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609" – Nikhilesh Singh Mar 11 '19 at 17:05
  • 3
    You have to show the whole code you're using in user mode and kernel mode. I suspect that the code you're using in user mode enables the instructions retired fixed function counter, but the code you're using in kernel mode doesn't. – Hadi Brais Mar 11 '19 at 18:20
  • 2
    You might find it helpful to look at how it is done in [NanoBench](https://github.com/andreas-abel/nanoBench). – Andreas Abel Mar 12 '19 at 00:59
  • @HadiBrais I have added the code that I am using to give a better insight. – Nikhilesh Singh Mar 12 '19 at 05:21
  • 1
    You're still only showing the code that uses `rdpmc`. You haven't shown any code that programs the PMU. You linked https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214 in a comment on my answer, but you still haven't even *mentioned* in your question doing anything to make the counter count anything. Just that you got the `rdmpc` code itself from there. – Peter Cordes Mar 12 '19 at 05:24
  • According to man page of rdpmc, setting particular values in ecx register programs the PMU to count a corresponding event and give the output in eax and edx register. That is what the given does in the line c = (1<<30), a change in this value can reprogram to count some other event. https://www.felixcloutier.com/x86/rdpmc – Nikhilesh Singh Mar 12 '19 at 05:44
  • 2
    You have to first enable or program the counter that you want to read using `rdpmc`. Even your user mode code doesn't work; it will just print zero. The reason that you think it's working is because you're using `%ld` format to print a `double` value, which basically reinterprets zero into a big integer. The correct code is the one from the comment posted on "Thu, 11/17/2016 - 17:41" by Kumar C on the Intel forum. – Hadi Brais Mar 12 '19 at 05:51
  • @HadiBrais Thanks for the insight. I will go through these details and get back. – Nikhilesh Singh Mar 12 '19 at 06:03
  • 1
    @HadiBrais: it's not reintepreting `0`, it's looking at an integer register instead of xmm0 and getting some non-zero bit-pattern. – Peter Cordes Mar 12 '19 at 09:53
  • @HadiBrais I checked out the code that you suggested. I have trouble replicating it in the kernel space because of the line ioctl (fd, PERF_EVENT_IOC_RESET, 0); It is taking the file descriptor returned by perf_event_open and resetting the macro. This is seemingly unavailable in the kernel space – Nikhilesh Singh Mar 12 '19 at 15:44
  • @HadiBrais I have edited the question to make it more precise. Thank you for your insights. – Nikhilesh Singh Mar 12 '19 at 16:51
  • The whole point of `ioctl` is to enable user code to call custom system calls provided by kernel modules or device derives. If you are already in kernel mode, then you can just directly call whatever function you want to call. You only need to include the required kernel header files. – Hadi Brais Mar 12 '19 at 18:36

1 Answers1

3

You probably don't want the overhead of reprogramming a counter inside the context-switch function.

The easiest thing would be to make system calls from user-space to program the PMU (to count some event, probably setting it to count in kernel mode but not user-space, just so the counter overflows less often).

Then just use rdpmc twice (to get start/stop counts) in your custom kernel code. The counter will stay running, and I guess the kernel perf code will handle interrupts when it wraps around. (Or when its PEBS buffer is full.)

IDK if it's possible to program a counter so it just wraps without interrupting, for use-cases like this where you don't care about totals or sample-based profiling, and just want to use rdpmc. If so, do that.


Old answer, addressing your old question which was based on a buggy printf format string that was printing non-zero garbage even though you weren't counting anything in user-space either.

Your inline asm looks correct, so the question is what exactly that PMU counter is programmed to count in kernel mode in the context where your code runs.

perf virtualizes the PMU counters on context-switch, giving the illusion of perf stat counting a single process even when it migrates across CPUs. Unless you're using perf -a to get system-wide counts, the PMU might not be programmed to count anything, so multiple reads would all give 0 even if at other times it's programmed to count a fast-changing event like cycles or instructions.


Are you sure you have perf set to count user + kernel events, not just user-space events?

perf stat will show something like instructions:u instead of instructions if it's limiting itself to user-space. (This is the default for non-root if you haven't lowered sysctl kernel.perf_event_paranoid to 0 or something from the safe default that doesn't let user-space learn anything about the kernel.)

There's HW support for programming a counter to only count when CPL != 0 (i.e. not in ring 0 / kernel mode). Higher values for kernel.perf_event_paranoid restrict the perf API to not allow programming counters to count in kernel+user mode, but even with paranoid = -1 it's possible to program them this way. If that's how you programmed a counter, then that would explain everything.

We need to see your code that programs the counters. That doesn't happen automatically.

The kernel doesn't just leave the counters running all the time when no process has used a PAPI function to enable a per-process or system-wide counter; that would generate interrupts that slow the system down for no benefit.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Currently, I have kernel.perf_event_paranoid set to -1. Does this affect reading counters from kernel? I thought it was just a way to allow non-root users to user perf. I'll just check it and update in a while. – Nikhilesh Singh Mar 11 '19 at 17:13
  • @NikhileshSingh: It affects how the kernel programs the PMU. Your code for reading the PMU is correct, so the question is what that counter index is programmed to count at the time when you're running `rdpmc` in kernel mode. – Peter Cordes Mar 11 '19 at 17:15
  • @NikhileshSingh: how are you programming the PMU counters in the first place to get them to be counting anything when your code runs? Are you using `perf -a` in user-space? Are you using `papi` system calls? Read the 2nd paragraph of my answer, and update your question with the info Hadi asked for. – Peter Cordes Mar 12 '19 at 04:19
  • I tested with kernel.perf_event_paranoid at all the values, (1,2 and 3), still I am getting 0 as output. – Nikhilesh Singh Mar 12 '19 at 04:40
  • @NikhileshSingh: you already said that. Don't spam me by reposting the same comment. But what *else* are you doing to program the PMU to count something? Are you running `perf stat -a`? If not, then the kernel won't have the HW performance counters counting anything (because that would create extra interrupts when they overflow). Just setting `perf_event_paranoid` *allows* user-space to ask the kernel to have counting enabled in kernel mode, but it doesn't actually do it unless you make a system call. (Or run `perf` to make a system call.) – Peter Cordes Mar 12 '19 at 04:44
  • I am using the same piece opf code with rdpmc instruction from the user space as well. I am not using perf or PAPI. This thread describes the mechanism https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214 – Nikhilesh Singh Mar 12 '19 at 04:44
  • @NikhileshSingh: Then edit your question with what exactly you're doing to make a MCVE, and ping Hadi about it. It's almost certain that you got something wrong. e.g. maybe you only have it programmed to count events in user-space, not in ring 0. – Peter Cordes Mar 12 '19 at 04:46
  • I have edited the question and tried to make a bit more descriptive, by adding the actual code. – Nikhilesh Singh Mar 12 '19 at 05:20
  • 1
    @PeterCordes: There's hardware support for not counting while in kernel. Specifically, each counter's config has a pair of flags (one for "enable counting CPL=0" and another for "enable counting CPL !=0"), plus possibly additional flags (e.g. "from all logical CPUs in core or from one logical CPU in the core and which one", and "bogus/not bogus", etc) depending on the type of event being used by malicious attackers for timing side-channels... ;-) – Brendan Mar 12 '19 at 08:41
  • @Brendan how do we toggle this feature for a counter? Can you provide some resources? – Nikhilesh Singh Mar 12 '19 at 15:47
  • @PeterCordes edited the question to make it more precise. Thanks for your insights. – Nikhilesh Singh Mar 12 '19 at 16:52
  • 1
    @NikhileshSingh: I only know about the low level hardware and don't know about the abstract/portable interface Linux provides (that you're using, at least when you're not violating the contract by using `rdpmc` directly); but it only took me 60 seconds to find out that the `sample_type` argument to `perf_event_open()` includes an `exclude_user` flag and an `exclude_kernel` flag. – Brendan Mar 12 '19 at 18:47
  • Thanks @Brendan As per the man pages, exclude_user and exclude_kernel flags, if set, then the count excludes events that happen in user space or kernel space respectively. This should not necessarily mean that one can't count from user/kernel space. – Nikhilesh Singh Mar 13 '19 at 04:52
  • @NikhileshSingh: if you want RDPMC to work in kernel space, make sure your programming of the PMU is done without `exclude_kernel`, or the equivalent for however you actually are programming the PMU (since you said you aren't using `perf`). If you *did* set that flag, the kernel would program the PMU in a way that didn't count in kernel mode. This seems obvious. You probably don't want the overhead of reprogamming a counter inside the context-switch function, just program the counters (e.g. from user-space), then use `rdpmc` in your custom kernel code. – Peter Cordes Mar 13 '19 at 05:18
  • @PeterCordes: The hardware feature is "global" (and inaccessible from user-space) and the Linux abstraction provides "per-process counters" by switching/reprogramming the counters during context switches; and most of the low level hardware is "CPU model specific" (including the fixed number of counters supported by the CPU). To make it possible for kernel to (safely/portably) use PMCs would involve significant modification of pre-existing code (e.g. adding support for reserving counters as "kernel use only" during boot and taking that into account everywhere including in context switch code). – Brendan Mar 14 '19 at 11:06
  • Mostly what I'm saying is that (to avoid breaking everything) it's not "just do these 3 simple things" and it is "read and understand all the existing code and modify a relatively large amount of it". – Brendan Mar 14 '19 at 11:11
  • @Brendan: The simple way, if you control the *whole* system like the OP here, is to run `perf stat -a` (from userspace) or something to do system-wide profiling. I *think* that will just leave some counters programmed at all times. Find the right counter to read, e.g. by experiment and use it. Yes in the general case, you need to get `perf` / PAPI to *not* touch some or all PMCs, like you would if you were using one of the alternative PMU libraries such as https://github.com/obilaniu/libpfc. But usually just not using `perf` at the same time as you're hacking around with the PMU is fine. – Peter Cordes Mar 14 '19 at 11:17
  • @PeterCordes: In that case (temporary unsafe and non-portable hack for one specific set of circumstances to get information that is likely useless because it doesn't apply to other CPUs in other circumstances); it'd be much easier and equally ineffective to just use user-space tools in the first place. – Brendan Mar 14 '19 at 11:27
  • @Brendan: I think (hope?) the OP just wants to use RDPMC similarly to RDTSC, to take start/end count deltas within one function. (i.e. to `perf stat` individual runs of a tiny block). I don't think you can program a PMU to only count events when RIP is in a certain range, so you need a `rdpmc` or other signpost, right? RDPMC has lower overhead than RDTSC, last I read, so with an event like `cycles` it's maybe a good choice for timing a short block of code. (modulo OoO exec / `lfence`). I haven't tried it; I've always just used repeat loops in a static executable + `perf stat` :P – Peter Cordes Mar 14 '19 at 11:32
  • 1
    @PeterCordes" I have no idea why OP wants this - I assumed (hoped?) it was some kind of Spectre or Rowhammer mitigation (e.g. kernel monitoring cache misses or branch mispredictions and taking evasive action if a threshold is exceeded). Performance measurements to improve software is a huge waste to make performance worse - there's plenty of examples of this idiocy (e.g. people adding "branch hints" into Linux only to find that several years later they're all wrong and hurt performance on newer CPUs, people avoiding shifts on Netburst/Williamette, people thinking "rep movsb" is slow, ..). – Brendan Mar 14 '19 at 11:40
  • @Brendan: Yeah, I usually only use microbenchmarks to learn more about the microarchitecture (perf counting a test case I created to test something, although I will microbench a long-running loop I'm tuning), not for trying to profile tiny parts of large systems. Microbenchmarking is hard. Usefully using the results is even harder! – Peter Cordes Mar 14 '19 at 13:01
  • @PeterCordes I have updated the question based on some readings that I did. – Nikhilesh Singh Mar 27 '19 at 09:42