Logging all memory accesses of any executable/process in Linux

Question

I have been looking for a way to log all memory accesses of a process/execution in Linux. I know there have been questions asked on this topic previously here like this

Logging memory access footprint of whole system in Linux

But I wanted to know if there is any non-instrumentation tool that performs this activity. I am not looking for QEMU/ VALGRIND for this purpose since it would be a bit slow and I want as little overhead as possible.

I looked at perf mem and PEBS events like cpu/mem-loads/pp for this purpose but I see that they will collect only sampled data and I actually wanted the trace of all the memory accesses without any sampling.

I wanted to know is there any possibility to collect all memory accesses without wasting too much on overhead by using a tool like QEMU. Is there any possibility to use PERF only but without samples so that I get all the memory access data ?

Is there any other tool out there that I am missing ? Or any other strategy that gives me all memory access data ?

How long is the process you want to record? Is it runs for 1 second? how often there are memory access instructions in it, around 1 of every 3 instructions? So you will (assuming 3Ghz CPU with IPC~1) have 1 000 000 000 memory accesses each with around 8 bytes of metadata (type, target address of 48-52 bits) - this is 8 gigabytes! **You can't record all memory accesses of program without overhead** (without very costy 20k USD+ hardware sniffers). You may sample 1/1000 or 1/100000 of them with PT/perf; or you may record all with 10x-20x-50x slowdown with valgrind or qemu or some other simulator. — osgx, May 20 '17 at 03:19
Hi @osgx, I was specifically looking for spec2006 programs(the longest running program is probably 8 mins- I think which would be very large). This means that I cannot have any other tools. I will need to use hardware sniffers if I want to record all memory accesses right ? Other than that, I do not have other options if I really want to avoid overhead. — Arnabjyoti Kalita, May 20 '17 at 14:53
Hi @osgx, is there any way I can use perf/pt/pebs to collect all the memory access samples without sampling ? Or do I altogether follow a different strategy ? — Arnabjyoti Kalita, May 20 '17 at 15:50
**why you want to avoid any overhead?** (For 10 minutes program and 10x overhead you will wait 2 hours which is faster than waiting for answer here.) Do you have money to buy hardware sniffers (You need JTAG sniffer XDP http://blog.asset-intertech.com/test_data_out/2016/07/the-three-types-of-jtag-access-on-intel-based-designs.html to get all accesses to cache, or bus/ddr sniffer to get only real memory accesses)? Can you estimate how many accesses are there and how much will it take to write them down? Why you want every memory access to be logged? — osgx, May 20 '17 at 15:57
I wanted to know if it actually is possible without any overhead. We are trying to see that. There is no precise reason why we want every memory access, but we are trying to measure the overhead on logging all memory accesses. There are a lot of memory access events in our programs. (could be to the tune of millions). We are not interested in using any hardware sniffers as of now. — Arnabjyoti Kalita, May 29 '17 at 16:34
Kalita: Memory bandwidth of the memory subsystem is limited. Some programs (BLAS1, SpMV, STREAM, RandomAccess, memlat; listed in https://stackoverflow.com/a/44234636) saturate memory subsystem fully (bandwidth or latency limited). Any in-system memory access tracing (PEBS,PT,...) will double necessary memory bandwidth (or triple: for every memory access you will write tens bytes of tracing data into same subsystem). There will be overhead. If you have no reason to want "no overhead at all", just do record with overhead and measure overhead. Full memory logging is not for production servers. — osgx, May 29 '17 at 16:53

score 1 · Accepted Answer · answered May 29 '17 at 17:01

It is just impossible both to have fastest possible run of Spec and all memory accesses (or cache misses) traced in this run (using in-system tracers). Do one run for timing and other run (longer,slower), or even recompiled binary for memory access tracing.

You may start from short and simple program (not the ref inputs of recent SpecCPU, or billion mem accesses in your big programs) and use perf linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. There is perf mem tool or you may try some PEBS-enabled events of memory subsystem. PEBS is enabled by adding :p and :pp suffix to the perf event specifier perf record -e event:pp, where event is one of PEBS events. Also try pmu-tools ocperf.py for easier intel event name encoding and to find PEBS enabled events.

Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory performance tests. Check worst case of memory recording overhead at left part on the Arithmetic Intensity scale of [Roofline model](https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/. Typical tests from this part are: STREAM (BLAS1), RandomAccess (GUPS) and memlat are almost SpMV; many real tasks are usually not so left on the scale:

STREAM test (linear access to memory),
RandomAccess (GUPS) test
some memory latency test (memlat of 7z, lat_mem_rd of lmbench).

Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?

Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes (the memory address, sometimes: instruction address) to be recorded to the same memory. So, having memory tracing enabled (more than 10% or memory access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.

Your CPU E5-2620 v4 is Broadwell-EP 14nm so it may have also some earlier variant of the Intel PT: https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org/processor-trace and especially Andi Kleen's blog on pt: http://halobates.de/blog/p/410 "Cheat sheet for Intel Processor Trace with Linux perf and gdb"

PT support in hardware: Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.

PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly:

http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf - LLC misses recorded to offline analysis, no timing was recorded from tracing runs
http://users.ece.utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf - all load/stores instrumented by writing into additional huge tracing buffer to periodic (rare) online aggregation. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core.
http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (by Aamer Jaleel of Intel Corporation, VSSAD) - Pin-based instrumentation - program code was modified and instrumented to write memory access metadata into buffer. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core. The paper lists and explains instrumentation overhead and Caveats:

Instrumentation Overhead: Instrumentation involves injecting extra code dynamically or statically into the target application. The additional code causes an application to spend extra time in executing the original application ... Additionally, for multi-threaded applications, instrumentation can modify the ordering of instructions executed between different threads of the application. As a result, IDS with multi-threaded applications comes at the lack of some fidelity

Lack of Speculation: Instrumentation only observes instructions executed on the correct path of execution. As a result, IDS may not be able to support wrong-path ...

User-level Traffic Only: Current binary instrumentation tools only support user-level instrumentation. Thus, applications that are kernel intensive are unsuitable for user-level IDS.

Logging all memory accesses of any executable/process in Linux

1 Answers1

Linked