3

I found mtrace by Dr.Clements. Although it is useful, it doesn't work normally in the situation I need. I intend to use the record to understand memory access pattern in different scenario.

Can someone share the related experience? Any suggestion will be appreciated.

0313 Updated: I'm trying to use qemu-mtrace to boot ubuntu 16.04 with linux-mtrace(3.8.0), but it only show several error message and terminated. Hope some tool be able to log every access.

$ ./qemu-system-x86_64 -mtrace-enable -mtrace-file mtrace.out -hda ubuntu.img -m 1024
Error: mtrace_entry_ascope (exit, syscall:xx) with no stack tag!
mtrace_entry_register: mtrace_host_addr failed (10)
mtrace_inst_exec: bad call 140734947607728
Aborted (core dumped)
Alex Will
  • 35
  • 5
Kaniel Venson
  • 181
  • 2
  • 12
  • 1
    What is your CPU arch, is it x86? Please, give more details about your scenario and explain, why qemu-mtrace doesn't work for you. Do you want to log *every* memory access from user program, or with kernel accesses too? (you can't as there is no enough memory for logs longer than seconds; you need $5000+ external proprietary jtag debugger/dram logger hardware). Or you can just log 1/100000 of all accesses to see patterns? In the second case, check `perf mem` tool http://man7.org/linux/man-pages/man1/perf-mem.1.html (supported on x86/x64 Intels). – osgx Mar 12 '17 at 15:25
  • Thanks for your reply. CPU arch is x86, and I want to analyze log the footprint of whole system for scenario such like playing stream video, browse the internet ...etc, and qemu-mtrace doesn't work normally in using kernel(linux-mtrace) with ubuntu 16.04, only show error msg and terminated as question I updated. `perf mem` looks useful, but it seems cannot logging for system-wide? – Kaniel Venson Mar 13 '17 at 11:29
  • 1
    You can't trace all memory accesses of the tracer when the tracer is inside the system under tracing. mtrace needs special kernel and special mode of its compilation. Do you need whole log or only some statistics? There can be `-a` option of `perf mem` or `-C cpulist` to do system-wide or per cpu core mem tracing (still it can be useful to recompile everything with frame pointers or just some debug information for unwinding) – osgx Mar 13 '17 at 14:26
  • Thank you so much for helping out, I use `perf mem record -a` to achieve effect what I want. And as you said, log dump from perf is so large @@. Thanks again :) – Kaniel Venson Mar 14 '17 at 03:50

1 Answers1

4

There is perf mem tool implemented for some modern x86/EM64T CPUs (probably, Intel-only; Ivy and newer desktop/server cpus). Man page of perf mem is http://man7.org/linux/man-pages/man1/perf-mem.1.html and same text in kernel docs dir: http://lxr.free-electrons.com/source/tools/perf/Documentation/perf-mem.txt. The text is incomplete; the best docs are sources: tools/perf/builtin-mem.c & partially in tools/perf/builtin-report.c. No details in https://perf.wiki.kernel.org/index.php/Tutorial.

Unlike qemu-mtrace it will not log every memory access, but only every Nth access where N is like 10000 or 100000. But it works with native speed and low overhead. Use perf mem record ./program to record pattern; try to add -a or -C cpulist for system-wide or global sampling for some CPU cores. There is no way to log (trace) all and every memory access from inside the system (tool should write info to memory and will log this access - this is infinite recursion with finite memory), but there are very costly proprietary system-specific external tracing solutions like JTAG or SDRAM sniffer ($5k or more).

The tools of perf mem where added around 2013 (3.10 version of linux kernel), there are several results of searching perf mem on lwn: https://lwn.net/Articles/531766/

With this patch, it is possible to sample (not trace) memory accesses (load, store). For loads, the instruction and data addresses are captured along with the latency and data source. For stores, the instruction and data addresses are capture along with limited cache and TLB information.

The current patches implement the feature on Intel processors starting with Nehalem. The patches leverage the PEBS Load Latency and Precise Store mechanisms. Precise Store is present only on Sandy Bridge and Ivy Bridge based processors.

Physical address sampling support added: https://lwn.net/Articles/555890/ (perf mem --phys-addr -t load rec); (there is also bit related 2016 year c2c perf tool "to track down cacheline contention": https://lwn.net/Articles/704125/ with examples https://joemario.github.io/blog/2016/09/01/c2c-blog/)

Some random slides on perf mem:

Some info on decoding perf mem -D report: perf mem -D report

 # PID, TID, IP, ADDR, LOCAL WEIGHT, DSRC, SYMBOL
 2054  2054 0xffffffff811186bf 0x016ffffe8fbffc804b0    49 0x68100842 /lib/modules/3.12.23/build/vmlinux:perf_event_aux_ctx

What does "ADDR", "DSRC", "SYMBOL" mean?

(answered by the same user as in this answer)

  • IP - PC of the load/store instruction;
  • SYMBOL - name of function, containing this instruction (IP);
  • ADDR - virtual memory address of data, requested by load/store (if there was no --phys-data option)
  • DSRC - "Decoded Source".

There is also sorting to get some basic stats: perf mem rep --sort=mem - http://thread.gmane.org/gmane.linux.kernel.perf.user/1438

Other tools.. There is (slow) cachegrind emulator based on valgrind for simulating cache memory for userspace prograns - "7.2 Simulating CPU Caches" of https://lwn.net/Articles/257209/. There should also be something for low-level (slowest) models related to DRAMsim/DRAMsim2 http://eng.umd.edu/~blj/dramsim/

Community
  • 1
  • 1
osgx
  • 90,338
  • 53
  • 357
  • 513