How to record a timestamped trace of every memory access?

Question

Is there a way to record every memory access of a given program including timestamps. Can perf be used to do that?

Please describe specifically how you already use `perf` traces and exactly what's missing. This is highly hardware specific so it helps to know what CPU you use. — Zulan, Aug 30 '18 at 14:17
Given the three excellent answers this question attracted, I tried to make the question more clear and different from https://stackoverflow.com/q/44080947/620382 (which asks about the entire system). It is still rather broad, but I think the answers show it is in fact answerable. — Zulan, Aug 31 '18 at 07:55

BeeOnRope · Answer 1 · 2018-09-03T17:08:36.617

If you are on Intel, I think the Intel PT feature mentioned in the other answers combined with post-processing and analysis is most likely to get you what you want at high speed (i.e., something like a single digit regression in performance).

If you don't care about performance, you could use any number of binary instrumentation frameworks to get this information. For example, the valgrind framework has a cachegrind tool which captures every memory access and uses them to estimate cache behavior based on an idealized caching model.

You could pretty much modify the cachegrind tool to spit out the list of accesses you are after, along with a timestamp. Of course, the problem is that cachegrind probably runs something like 10 times slower than the native application, so your timestamps will both be "stretched out" and distorted (i.e., because various parts of the program might have different instrumentation overheads).

Whether that matters for your application is up to you.

The nice thing about Valgrind is that it doesn't rely on any particular hardware and works across different hardware architectures. It is probably also easier than getting an Intel PT-based analysis working - although I'm not 100% sure since I have tried either myself.

If you don't care about the total runtime of the actual process while you are recording, but need mostly accurate timing figures, you could also consider running your process under a CPU simulator, such as the Sniper x86 simulator or gem5 that Peter mentions in the comments.

This site which describes the CMP$im tool may be very useful for you. It is able to produce a trace of accesses using Intel's PIN technology, which @Leeor also mentioned in the comments below. I recommend taking a look at the author's associated papers, linked from that site.

The [gem5](http://www.gem5.org/Main_Page) cache simulator might also be useful, if you're considering `cachegrind`. But I haven't used either of those tools for anything; you'll probably have a hard time simulating the combination of CPU instruction latencies and out-of-order execution's ability to find parallelism out of your cache-miss pattern with simulators vs. just running real code on real hardware. (Good answer, just adding caveats for the OP.) — Peter Cordes, Aug 30 '18 at 18:09
@PeterCordes - yes, could also use a simulator, which kind of gives you an intermediate point between listing the actual accesses in a full-speed run (maybe hard or impossible depending on the hardware), and listing the accesses exactly using instrumentation or emulation, but with different hardware characteristics. Simulation like gem5 or sniper gives you the flexibility to do whatever you want but still have approximate real-life performance as determined by the simulator (the actual speed of simulation is still really slow, of course). — BeeOnRope, Aug 30 '18 at 19:19
I haven't used any of these simulators, but I've considered writing one :). I'm quite sure the combined knowledge from the x86 tag on SO could result in a more accurate similar than what is out there today, at least for x86. An interesting use case for a similar is for deterministic performance regression testing: on every change you could run the simulator and flag any regressions beyond X% or whatever. This would be very difficult if you just timed the actual code due to noise, non-deterministic, varying underlying hardware, etc. — BeeOnRope, Aug 30 '18 at 19:21
Simulators may be hard to use with arbitrary code, gem5 is also notoriously bad at x86 from my experience. I'd suggest emulation/binary instrumentation tools like Pin, it comes with a simple built-in example that does exactly that. — Leeor, Sep 01 '18 at 16:42

Peter Cordes · Answer 2 · 2018-08-30T17:20:15.547

The closest hardware capability I can think of is Intel PT (processor trace), which can record timestamps on every (taken?) branch, so you can reconstruct execution down to the block containing the loads. I haven't used pt, and I'm not sure if perf can use it, or if you need different programs.

(Not exactly a "basic block" because there's no record when executing past the target of a branch somewhere else)

That's probably only when the load instructions were issued, not when out-of-order execution actually ran them or when the data arrived from memory/L1d cache.

I don't think any existing x86 chips can record accurate timestamps for every load completion; that would be too much data.

If you're looking for memory hotspots, I'd suggest profiling with perf record -e mem_load_retired.l3_miss,mem_load_retired.l2_miss or similar counters, to look for loads that miss often in different levels of cache. There are some store events, but mostly for loads because the CPU has to wait for load data to arrive before it can use it.

Maybe also dtlb_load_misses.miss_causes_a_walk or other TLB-miss events.

There's also an event for cycle_activity.stalls_l3_miss which counts every cycle when stalled, to look for cases where OoO exec couldn't hide cache-miss latency.

Use perf list to see the events perf knows about. If your perf is old, you might need the ocperf.py wrapper for it. https://github.com/andikleen/pmu-tools

Arnabjyoti Kalita · Answer 3 · 2018-08-30T17:10:45.487

2

IntelPT will record timestamps and track the control flow information of a running application with various packets that will be logged into the hardware. This information from IntelPT can then act as input to decoders which will help to obtain the disassembled trace of instructions. And IntelPT has also been integrated into perf.

You can use perf with IntelPT as events as below -

perf record -e intel_pt//[uk]  /bin/ls

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.384 MB perf.data ]

However, what I would suggest is using PEBS (Precise Event Based Sampling). PEBS (Precise Event Based Sampling) is a feature available to a subset of events which allows the hardware to collect additional information very close to the exact time the configured event overflowed. You can use PEBS with perf as well.

Say you want to record information related to a memory load. The PEBS counters will be initialized to a certain maximum value (which is actually the period of sampling). These counters will then decrement by one with each memory load. As soon as the counter hits zero, the PEBS hardware gets armed. The next memory load event will then cause a PEBS record to be written into the PEBS buffer. Once this happens, the PEBS counters are automatically reset to their previous value. This is how, a sample period of 2 will cause the system to record memory loads after an interval of 2.

Anyway, one benefit of using PEBS is that it is very precise, which can be guessed from the way it works. Unlike most other recording mechanisms, where you have to essentially wait for software interrupts to record the event details and the recording happens hundreds of CPU cycles later.

Use PEBS in conjuction with perf to record memory loads like this -

perf record -e r81d0:pp -c 1 -d <application_name> <application_params>

r81d0:pp represents the event memory loads amongst retired instructions in numeric form. In certain cases, certain CPU architectures will not support some events and one is forced to use numeric events like this.

However like Peter said and as has been highlighted in many other questions and answers here, it is absolutely impossible to record 100% memory load or memory store addresses without external hardware mechanisms and/or causing significant overheads to the runtime.

Want to read about PEBS ?. Intel's software developer manual will be your best friend.

edited Aug 30 '18 at 17:10

answered Aug 30 '18 at 17:04

Arnabjyoti Kalita

2,325
1
18
31

I'd recommend `mem_inst_retired.all_loads` instead of a numeric event value (with `ocperf.py` if your `perf` is too old for that event). Are you sure all CPUs use the same event number for that event? Anyway, finding how often each load instruction runs on average sounds pretty far from what the OP was asking for. – Peter Cordes Aug 30 '18 at 17:18
The OP was asking for point-in-time memory accesses. I think PEBS is as close as you can get to point-in-time accesses. But yes, it will not record all of the memory accesses - I have already mentioned that. – Arnabjyoti Kalita Aug 30 '18 at 17:22
Oh, right with `-c 1` then yeah, the counter should roll over every load, giving you a PEBS record. If `perf` lets you set such a low counter threshold. Yeah that's an interesting idea that might give you some chunks of useful data from between `perf` interrupts to collect the PEBS buffer. – Peter Cordes Aug 30 '18 at 17:34
I would be kind of surprised if `-c 1` gave you every event, but maybe. If it's actually interrupting every access though it will be incredibly slow, probably much slower even than instrumentation or emulation based approaches. – BeeOnRope Aug 30 '18 at 17:56
I don't disagree going with PEBS, but I think your arguments are slightly off. It seems that [in many cases](https://github.com/torvalds/linux/commit/3569c0d7c5440d6fd06b10e1ef9614588a049bc7), Linux still gets an interrupt after each PEBS event. Getting [timestamps from PEBS](https://github.com/torvalds/linux/blob/9f8f16c86e4d9e2afcbdcd6045981c4d9129450e/arch/x86/events/intel/ds.c#L1290) itself is possible since Skylake. So just from using PEBS there is no guarantee that the *timestamps* are precise for the event. – Zulan Aug 30 '18 at 19:53
No, Linux does not get an interrupt after each PEBS event. I am quoting the Intel software manual : "Only PEBS buffer overflows are the sources of an interrupt if we handle Precise Events (eg. PEBS) . However, for Non-Precise events, detection of counter overflows are usually the indications of an interrupt". I was thinking along the same lines as you @Zulan when I started studying PEBS - but I stand corrected now – Arnabjyoti Kalita Aug 31 '18 at 01:31
Yes @BeeOnRope, a period of 1 is only theoretically allowing you to record each and every memory access details. But practically it is near to impossible to record everything. And interrupts will not happen after every collection. It will only happen when the PEBS buffer gets full as I have mentioned in my earlier comment. – Arnabjyoti Kalita Aug 31 '18 at 01:35
1

@ArnabjyotiKalita - what the Intel manual says may be correct but Linux may configure it get to an interrupt after each event (e.g., by setting the PEBS buffer size to 1). As I recall that's how it was initially implemented since it made all the new PEBS events more or "just work" within the existing interrupt-based perf framework. Newer kernels have fixed that though, IIRC. – BeeOnRope Aug 31 '18 at 01:49
@BeeOnRope that is true though. The PEBS threshold interrupt was indeed set to be just 1 PEBS record away from the PEBS buffer base in 1 year old Linux kernels. This meant that the interrupt fired after the PEBS mechanism wrote one record to the buffer. I have tuned the threshold to 300 IIRC and managed to record atleast 20% of all memory access details (than when the threshold was 1 and I recorded about 10%). – Arnabjyoti Kalita Aug 31 '18 at 01:56
For what it's worth, I tried `-c 1` with a test app that does 1 million loads in a tight loop. I only got ~18,000 samples on the load line. That's much more than the default count which gave 4 or 5. `-c 300` gave only ~3000 samples while `-c 3000` gave only ~300. I'm not sure if `perf record` is actually reading all pebs samples? – BeeOnRope Aug 31 '18 at 02:35
@ArnabjyotiKalita do you know how to verify if a `perf record` run really uses a threshold > 1? There are [quite some restrictions](https://github.com/torvalds/linux/blob/4658aff6eeaaf0b049ce787513abfc985c452e3a/arch/x86/events/intel/core.c#L3001). Also you [cannot *configure* the threshold](https://github.com/torvalds/linux/blob/4658aff6eeaaf0b049ce787513abfc985c452e3a/arch/x86/events/intel/ds.c#L908), it's always `PAGE_SIZE << 4`. – Zulan Aug 31 '18 at 07:46
@Zulan The `perf record` will use a threshold > 1 only if the total number of PEBS events != number of freerunning PEBS events. In my case, I tried a couple of things - 1) I increased the [PEBS buffer size](https://elixir.bootlin.com/linux/latest/source/arch/x86/events/intel/ds.c#L354) to `PAGE_SIZE << 18`. 2) I commented out the check for `if (cpuc->n_pebs == cpuc->n_large_pebs) {` in the threshold code and instead modified this line - `threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;`. I tried them separately and in both cases, observed superior recording. – Arnabjyoti Kalita Sep 02 '18 at 03:08

How to record a timestamped trace of every memory access?

3 Answers3