Questions tagged [intel-pmu]

Questions related to the use of the Intel Performance Management Unit, which provides performance counters related to the performance of currently executing code.

The Intel performance management unit provides performance counters which track performance related metrics for the currently executing code.

They are useful while profiling code, and are supported by Intel's VTune, Linux's perf command and the Windows Performance Toolkit.

The counters and the details of how to program them vary by CPU architecture and the details are available in Chapter 18 and 19 of the Intel-64 and IA-32 Architectures Software Developer Manual, Volume 3.

Other libraries / tools for using the PMU include:

  • Likwid: Various performance-related tools, including a micro-benchmarking framework. Supports Intel-PMU, AMD perf counters, some ARM, POWER8/9, and some NVidia GPUs.

  • libpfc: A simple Linux kernel module and library to let user-space program the counters, so it can use rdpmc in user-space. Example usage in the author's answer to this SO question.

  • https://github.com/andikleen/pmu-tools some wrappers around Linux perf. ocperf.py used to be more useful, before perf itself got symbolic event names for more CPU-specific events. But there are other tools in that repo.

91 questions
31
votes
1 answer

What restriction is perf_event_paranoid == 1 actually putting on x86 perf?

Newer Linux kernels have a sysfs tunable /proc/sys/kernel/perf_event_paranoid which allows the user to adjust the available functionality of perf_events for non-root users, with higher numbers being more secure (offering correspondingly less…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
27
votes
0 answers

On Skylake (SKL) why are there L2 writebacks in a read-only workload that exceeds the L3 size?

Consider the following simple code: #include #include #include #include #include int cpu_ms() { return (int)(clock() * 1000 / CLOCKS_PER_SEC); } int main(int argc, char** argv) { if (argc <…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
19
votes
2 answers

Haswell memory access

I was experimenting with AVX -AVX2 instruction sets to see the performance of streaming on consecutive arrays. So I have below example, where I do basic memory read and store. #include #include #include #include…
edorado
  • 275
  • 2
  • 10
14
votes
5 answers

Can the Intel performance monitor counters be used to measure memory bandwidth?

Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level).
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
13
votes
2 answers

What is the overhead of using Intel Last Branch Record?

Last Branch Record refers to a collection of register pairs (MSRs) that store the source and destination addresses related to recently executed branches. http://css.csail.mit.edu/6.858/2012/readings/ia32/ia32-3b.pdf document has more information in…
user655617
  • 318
  • 3
  • 13
10
votes
1 answer

Hardware cache events and perf

When I run perf list I see a bunch of Hardware Cache Events, as follows: $ perf list | grep 'cache event' L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
10
votes
2 answers

Reliability of Xcode Instrument's disassembly time profiling

I've profiled my code using Instrument's time profiler, and zooming in to the disassembly, here's a snippet of its results: I wouldn't expect a mov instruction to take 23.3% of the time while a div instruction to take virtually nothing. This causes…
yairchu
  • 23,680
  • 7
  • 69
  • 109
9
votes
2 answers

Why does the number of uops per iteration increase with the stride of streaming loads?

Consider the following loop: .loop: add rsi, OFFSET mov eax, dword [rsi] dec ebp jg .loop where OFFSET is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the…
Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
9
votes
1 answer

Can the LSD issue uOPs from the next iteration of the detected loop?

I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop: BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 10000000 .loop: dec ecx ;| jz .end ;| 1…
Margaret Bloom
  • 41,768
  • 5
  • 78
  • 124
7
votes
2 answers

Perf stat equivalent for Mac OS?

Is there a perf stat equivalent on Mac OS? I would like to do the same thing for a CLI command and googling is not yielding anything.
stk1234
  • 1,036
  • 3
  • 12
  • 29
7
votes
2 answers

rdpmc: surprising behavior

I'm trying to understand the rdpmc instruction. As such I have the following asm code: segment .text global _start _start: xor eax, eax mov ebx, 10 .loop: dec ebx jnz .loop mov ecx, 1<<30 ; calling rdpmc with ecx = (1<<30)…
user14717
  • 4,757
  • 2
  • 44
  • 68
7
votes
0 answers

Why does Linux perf use event l1d.replacement for "L1 dcache misses" on x86?

On Intel x86, Linux uses the event l1d.replacements to implement its L1-dcache-load-misses event. This event is defined as follows: Counts L1D data line replacements including opportunistic replacements, and replacements that require…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
6
votes
2 answers

How does one enable Intel Processor Tracing (IPT) in a virtualized environment?

I am attempting to run Alex Ionescu's WinIPT interface in a virtual machine, and having no success. (This is a Windows 10 Pro host running a Windows 10 VM and both are the 18363 update) I have successfully built and run Intel's driver as well as…
echosys
  • 61
  • 4
6
votes
1 answer

Why are the user-mode L1 store miss events only counted when there is a store initialization loop?

Summary Consider the following loop: loop: movl $0x1,(%rax) add $0x40,%rax cmp %rdx,%rax jne loop where rax is initialized to the address of a buffer that is larger than the L3 cache size. Every iteration performs a store operation to…
Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
6
votes
2 answers

Can we measure successful store-forwarding with Intel's performance counters?

Is it possible to measure the number of successful store-forwarding operations using the performance counters on recent Intel x86 chips? I see events for ld_blocks.store_forward which measure failed store-forwarding, but it's clear to me if the…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
1
2 3 4 5 6 7