Precise measurements of maximum cycle count with RDTSC

Question

I'm developing low level routines for binary search in C and x64 assembly, and trying to measure the exact execution time for searches of uncached arrays (data in RAM). Searching the same array for different targets takes widely varying amounts of time depending on how "lucky" the branch prediction is. I can accurately measure the minimum and median execution times, but I'm finding it difficult to measure the maximum execution time.

The problem is that the worst-case scenario for branch prediction is comparable in time to an average-case scenario plus a processor interrupt. Both the worst-case scenario and the interrupt are rare, but I haven't come up with a good way to distinguish one rare event from the other. The standard approach is simply to filter out all "anomalously" high measurements, but that only works if there is a clear line between the two.

So the question becomes, "How can I distinguish between a measurement that was interrupted and one that legitimately took much longer than the rest?"

Or more generally, "How do I measure the full distribution of execution times without assuming in advance a hard maximum?"

Does the kernel store any information I could query about whether an interrupt has occurred? Something that I could query before and after the measurement that would tell me if the measurement was interrupted? Ideally it would tell me how long in cycles the interruption took, but just knowing that the measurement is affected would be a great start.

Maybe in addition to (or instead of) RDTSC, I can use RDPMC to read a counter that measures the number of cycles spent in Ring 0 (kernel) instead of Ring 3 (user)? Is there maybe a counter already set up to do this, or would I need to set up my own? Do I need to create my own kernel module to do this, or can I use existing ioctls?

Some background:

I'm running mostly running Ubuntu 14.03 Linux 4.2.0 on an Intel Skylake i7-6700, but am also testing on Intel Sandy Bridge and Haswell. I've already done my best to reduce the jitter on the system as much as possible. I've recompiled a tickless kernel with CONFIG_NOHZ_FULL, no forced preemption, transparent hugepage support off, and the timer frequency at 100 Hz.

I've stopped most unnecessary processes, and removed most unnecessary kernel modules. I'm using cpuset / cset shield to reserve one of the NoHZ cores for a single process, and have used kernel/debug/tracing to verify that I'm getting very few interrupts. But I'm still getting just enough that exact measurements are difficult. Maybe more importantly, I can envision future long-tail circumstances (a hash table that rarely needs to be resized) where being able to distinguish between valid and invalid measurements would be very helpful

I'm measuring execution time with RDTSC/RDTSCP using the techniques that Intel suggests in their whitepaper, and generally getting the accuracy I expect. My tests involve searching for 16-bit values, and I repeatedly and individually time each of the 65536 possible searches on random arrays of varying lengths. To prevent the processor from learning correct branch prediction, the searches are repeated in a different order each time. The searched array is removed from cache after each search with "CLFLUSH".

This is a research project, and my goal is to learn about these issues. As such, I'm willing to approaches that might otherwise be considered silly and extreme. Custom kernel modules, protected mode x64 assembly, untested kernel modifications, and processor specific features are all fair game. If there's a way to get rid of the few remaining interrupts so that all measurements are "real", that might be a viable solution as well. Thanks for suggestions!

I can't tell what your question is. All the detail doesn't add clarity here. — Jeff Hammond, Feb 04 '16 at 03:55
Hi Jeff -- Sorry if I failed on clarity. The questions are the two parts in bold. I'm looking for a way to determine retroactively if the processor was interrupted during an RDTSC measurement. For background, consider this thread: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/405642 where Patrick says "Note that, in the ring3 (user land), you may also get interrupts right in the middle of your code, which may mess up your counts." I'd like to figure out how to determine when that has happened. — Nathan Kurz, Feb 04 '16 at 05:10
I would assume first you need to be testing the fastest possible version of binary search. Check Bentley "Programming Pearls" p. 87. Example: `j=i+8; if (a[j] < key) i=j;` is one line in an unrolled list of statements, where in place of the number 8 there are descending powers of 2. Then for timing purposes, just run it 10^9 times and use a stopwatch. — Mike Dunlavey, Feb 05 '16 at 18:19
Hi Mike -- No, the goal isn't to test the fastest search, but to characterize different search approaches. Separately, Bentley's approach doesn't produce the fastest search on modern processors. Because of the decoded instruction cache and loop cache, unrolling doesn't make much of a difference anymore. Because of SIMD and multiscalar execution, counting instructions doesn't match up to execution time. What really matters is memory prefetching and branch prediction. This paper focuses primarily different layouts, but covers the issues I'm interested in: http://arxiv.org/abs/1509.05053. — Nathan Kurz, Feb 06 '16 at 00:23

Nicolai Stange · Accepted Answer · 2016-02-05T17:25:39.270

4

I assume that you have shielded your benchmarking thread to the extent possible:

It has exclusive access to its CPU core (not only HyperThread), see here on how to manage this easily.
Interrupt affinities have been moved away from that core, see here
If possible, run a nohz kernel in order to minimize timer ticks.

Furthermore, you should not jump into kernel space from within your benchmarked code path: upon return, your thread might get scheduled away for some time.

However, you simply can't get rid of all interrupts on a CPU core: on Linux, the local APIC timer interrupts, interprocessor interrupts (IPI) and others are used for internal purposes and you simply can't get rid of them! For example, timer interrupts are used to ensure, that threads get eventually scheduled. Likewise, IPIs are used to do trigger actions on other cores such as TLB shootdowns.

Now, thanks to the Linux tracing infrastructure, it is possible to tell from userspace whether a hardirq has happened during a certain period of time.

One minor complication is that Linux treats two classes of interrupts differently with respect to tracing:

First there are the "real" external hardware interrupts occupied by real devices like network adapters, sound cards and the such.
There are the interrupts for internal use by Linux.

Both are hardirqs in the sense that the processor asynchronously transfers control to an interrupt service routine (ISR) as dictated by the interrupt descriptor table (IDT).

Usually, in Linux, the ISR is simply a stub written in assembly which transfers control to a high-level handler written in C.

For details, refer to arch/x86/entry_entry_64.S in the Linux kernel sources. For the Linux-internal interrupts, a tracing stub is defined each while for the external interrupts, tracing is left to the high-level interrupt handlers.

This is way there is one tracing event for each internal interrupt:

# sudo perf list | grep irq_vectors:
  irq_vectors:call_function_entry                    [Tracepoint event]
  irq_vectors:call_function_exit                     [Tracepoint event]
  irq_vectors:call_function_single_entry             [Tracepoint event]
  irq_vectors:call_function_single_exit              [Tracepoint event]
  irq_vectors:deferred_error_apic_entry              [Tracepoint event]
  irq_vectors:deferred_error_apic_exit               [Tracepoint event]
  irq_vectors:error_apic_entry                       [Tracepoint event]
  irq_vectors:error_apic_exit                        [Tracepoint event]
  irq_vectors:irq_work_entry                         [Tracepoint event]
  irq_vectors:irq_work_exit                          [Tracepoint event]
  irq_vectors:local_timer_entry                      [Tracepoint event]
  irq_vectors:local_timer_exit                       [Tracepoint event]
  irq_vectors:reschedule_entry                       [Tracepoint event]
  irq_vectors:reschedule_exit                        [Tracepoint event]
  irq_vectors:spurious_apic_entry                    [Tracepoint event]
  irq_vectors:spurious_apic_exit                     [Tracepoint event]
  irq_vectors:thermal_apic_entry                     [Tracepoint event]
  irq_vectors:thermal_apic_exit                      [Tracepoint event]
  irq_vectors:threshold_apic_entry                   [Tracepoint event]
  irq_vectors:threshold_apic_exit                    [Tracepoint event]
  irq_vectors:x86_platform_ipi_entry                 [Tracepoint event]
  irq_vectors:x86_platform_ipi_exit                  [Tracepoint event]

while there is only a single general tracing event for the external interrupts:

# sudo perf list | grep irq:
  irq:irq_handler_entry                              [Tracepoint event]
  irq:irq_handler_exit                               [Tracepoint event]
  irq:softirq_entry                                  [Tracepoint event]
  irq:softirq_exit                                   [Tracepoint event]
  irq:softirq_raise                                  [Tracepoint event]

So, trace all these IRQ *_entries for the duration of your benchmarked code path and you know whether your benchmark sample has been poisoned by an IRQ or not.

Note that there is a third kind of hardware interrupt style on x86: exceptions. At the very least, I would check for page faults, too. And for NMIs that have been missed above (through nmi:nmi_handler).

For your convenience, I've put together a little piece of code for tracking IRQs during your benchmarked code path. See the included example.c for usage. Note that access to /sys/kernel/debug is needed in order to determine tracepoint IDs.

edited Feb 05 '16 at 17:25

answered Feb 05 '16 at 16:27

Nicolai Stange

262
1
5

Thanks, I think this is a very promising direction. Yes, I think I've done everything you suggest to reduce the interruptions, and the goal is this point is just to detect the remaining few. I've been reading lots about the dtrace interface, but hadn't seen examples perf_event for this. I'll play with your code, and then respond more fully. I think this approach may work well for me just minor modification to cover the 1 Hz hrtimer housekeeping events like Frederic shows here: https://kernel.googlesource.com/pub/scm/linux/kernel/git/frederic/dynticks-testing/+/master/trace.1.example – Nathan Kurz Feb 06 '16 at 00:42
If hrtimers aren't covered by my code yet, simply add `timer:hrtimer_expire_entry` to the `event_names` list in `lxdetectirq.h`. – Nicolai Stange Feb 06 '16 at 08:50
`timer/hrtimer_expire_entry`, of course: these are filenames. – Nicolai Stange Feb 06 '16 at 08:59
Two things: first of all, I believe that a `hrtimer_expire_entry` can only get triggered as a consequence of a `local_timer_entry`. Second: I've spotted some untraceable interrupts. You can find a more detailed discussion here: http://nicst.de/bench-user-irq-detect.html – Nicolai Stange Feb 06 '16 at 15:42
This is fabulous work and a great example, but I think I've concluded that 'perf_event' isn't the best abstraction here unless there is some 'wildcard' approach I haven't seen yet. If I were to use it as-is, I don't think there is any way to know if I'd missed some interrupt that wasn't explicitly on the list. From what I can tell, 'perf_event' forces a 'black list' approach, whereas the problem seems like a better fit for a 'white list' (starting with an empty white list). I think an approach that works with file I/O directly on the 'ftrace' level is more natural here. Thoughts? – Nathan Kurz Feb 06 '16 at 22:07
Well, the blacklist has been assembled by reading `arch/x86/entry_64.S` which contains all entry points into kernel space. If there is no entry into kernel space, then user space runs uninterrupted. I claim that the blacklist is exhaustive (assuming you eliminated the untraceable interrupts and exceptions), so no missed interrupts. And to answer your question: yes, there is no wildcard approach with perf_events. Regarding ftrace: what exactly would you trace? – Nicolai Stange Feb 06 '16 at 22:45
You make a fairly compelling case that the list is sufficient. With the 'ftrace' interface, though, I think I can just write "1" into "tracing/events/enable" at the beginning, and then write to "tracing/tracing_on" before and after the RDTSC. To test if any interrupts occurred, I'd see if anything can be read from "tracing/trace". I think this would be easier than having a file-descriptor per event? Although I'm not sure exactly how the read from "tracing/trace" will fail --- will I need to reset it after EOF? I also like that it leaves a "trace" file that can be viewed as desired. – Nathan Kurz Feb 07 '16 at 03:08
Ok, it *might* be possible to use the ftrace interface through `/sys/kernel/debug/` here. I doubt that it is easier though for several reasons. Tracing the predefined tracepoint events can be enabled through `tracing/events/enable` (all), `tracing/events//enable` or `tracing/events///enable`. The first one is unpractical since your mere read from `trace` will trigger a whole lot of events by itself. Thus, you're left with "blacklisting" (enabling) subsystems again. In your particular case, you'd choose `irq`, `irq_vectors`, `exceptions` and `nmi`. See `tracing/README`. – Nicolai Stange Feb 07 '16 at 11:27
Thus, you need to know what you want to trace with either approach. What's more, the `/sys/kernel/debug/tracing/` interface is far more complicated than `perf_event_open(2)`: you had to restrict tracing to your benchmarking thread by writiting to `tracing/set_event_pid`. Switching tracing on and off had to be done by writing to `tracing/tracing_on`. You had to read from `tracing/trace` and truncate it afterwards. And what's most important: you had to verifiy the guarantees reading from `tracing/trace` gives in order to never ever miss an event. I can't say that all this sounds like fun... – Nicolai Stange Feb 07 '16 at 11:38
I appreciate your thoughts. I've marked your answer as 'accepted' (because it's really solid) but I'm still thinking the 'ftrace' approach might come out cleaner. I don't particularly like the interface, but I feel like it's less complicated rather than more. Perf's one-fd-per-event feels awkward to me. For 'ftrace', it's only the 'noise' of stopping that tracing that concerns me. There are lots of workarounds, but it does add to the complexity. I'll try out both ways to see what works best, and return to post the 'ftrace' solution for comparison. Thanks for your assistance and ideas! – Nathan Kurz Feb 08 '16 at 10:24
Thank you very much, Nathan :) However you do it in the end, may I ask what exactly concerns you about the perf's "on-fd-per-event"? – Nicolai Stange Feb 08 '16 at 12:30
Hi Nicolai. My concerns may be irrational---mostly the perf approach just feels wrong. One issue is that I'm going to be measuring very small things (100-1000 cycles), and the overhead of multiple system calls after each reading may limit how many repetitions I can do. The other (stronger but amorphous) reason is that by looking at the kernel source, I feel I can understand how ftrace works. But with perf_event_open(), I get lost in the layers of abstraction. The interface itself is OK, but I don't feel that I can trust the results unless I understand what's happening under the hood. – Nathan Kurz Feb 11 '16 at 01:17
Ok, just to make it explicit: there is one fd per event, but all events are controlled and read through the group leader's fd. Thus, you get 2 ioctls (start, stop) and 1 read per iteration. The other, per-event fds are only there to "assist" the kernel in housekeeping. – Nicolai Stange Feb 11 '16 at 03:15
Explicit is good. I think I understand that your lxdetectirq code has 3 syscalls per iteration: start, stop, read. My uncertainty is what happens underneath that level to "multiplex" the individual events. I've since traced the execution, and realize that the situation isn't as bad as I thought. On the other hand, it's not great either. Here are traces of one iteration https://gist.github.com/nkurz/9cda8281bc2179e68ec1 and https://gist.github.com/nkurz/45824bfdd95a756dcbe6 . I'm trying to understand better how the changes to the "control" fd are propagated to the individual events. – Nathan Kurz Feb 11 '16 at 21:05
Ok, so you measured a total of approximately 40us of overhead per iteration due to the perf_event controlling and reading (the crossings user space <->kernel space not included, btw). Let's be generous and say 100us per iteration. Plus 1000 cycles for your actual measurement is another 1us on a 1GHz CPU and thus, neglectible. That's 1s/100us=10,000 samples per second. Should be enough to get some degree of statistical significance ;) Regarding on how it works under the hood: the entry point for an ioctl is here: http://lxr.free-electrons.com/source/kernel/events/core.c#L4324 (`perf_ioctl`). – Nicolai Stange Feb 12 '16 at 01:34

score 1 · Answer 2 · answered Dec 05 '18 at 04:12

I am aware of two "fast" ways to observe interrupts on x86, the first of which I have used myself.

You can use userspace rdpmc to read the hw_interrupts.received event directly, before and after your tested section, to determine whether any interrupts occurred. To program the counter in the first place and to handle the read, I've listed some libraries in this answer. If I was starting a new project now I'd probably use pmu-tools, or perhaps just use perf_event_open directly since it isn't all that terribly hard to implement.
Setting either %fs or %gs to a non-zero value before your timed region and then checking that the value is unchanged after. If it has been set back to zero, an interrupt occurred because the iret instruction resets these registers. On x86-64 you are best off using %gs since %fs is used for thread-local storage. Full details in this blog post, which is where I learned about it.

Precise measurements of maximum cycle count with RDTSC

2 Answers2