Questions tagged [intel-pmu]

Questions related to the use of the Intel Performance Management Unit, which provides performance counters related to the performance of currently executing code.

The Intel performance management unit provides performance counters which track performance related metrics for the currently executing code.

They are useful while profiling code, and are supported by Intel's VTune, Linux's perf command and the Windows Performance Toolkit.

The counters and the details of how to program them vary by CPU architecture and the details are available in Chapter 18 and 19 of the Intel-64 and IA-32 Architectures Software Developer Manual, Volume 3.

Other libraries / tools for using the PMU include:

Likwid: Various performance-related tools, including a micro-benchmarking framework. Supports Intel-PMU, AMD perf counters, some ARM, POWER8/9, and some NVidia GPUs.
libpfc: A simple Linux kernel module and library to let user-space program the counters, so it can use rdpmc in user-space. Example usage in the author's answer to this SO question.
https://github.com/andikleen/pmu-tools some wrappers around Linux perf. ocperf.py used to be more useful, before perf itself got symbolic event names for more CPU-specific events. But there are other tools in that repo.

91 questions

votes

1 answer

What restriction is perf_event_paranoid == 1 actually putting on x86 perf?

Newer Linux kernels have a sysfs tunable /proc/sys/kernel/perf_event_paranoid which allows the user to adjust the available functionality of perf_events for non-root users, with higher numbers being more secure (offering correspondingly less…

asked Aug 18 '18 at 18:08

BeeOnRope

60,350
16
207
386

votes

0 answers

On Skylake (SKL) why are there L2 writebacks in a read-only workload that exceeds the L3 size?

Consider the following simple code: #include #include #include #include #include int cpu_ms() { return (int)(clock() * 1000 / CLOCKS_PER_SEC); } int main(int argc, char** argv) { if (argc <…

performance x86 cpu-cache perf intel-pmu

asked Sep 29 '18 at 05:09

BeeOnRope

60,350
16
207
386

votes

2 answers

Haswell memory access

I was experimenting with AVX -AVX2 instruction sets to see the performance of streaming on consecutive arrays. So I have below example, where I do basic memory read and store. #include #include #include #include…

performance x86 cpu-architecture avx2 intel-pmu

asked Oct 27 '13 at 18:08

edorado

votes

5 answers

Can the Intel performance monitor counters be used to measure memory bandwidth?

Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level).

performance x86 intel-pmu memory-bandwidth

asked Dec 02 '17 at 21:37

BeeOnRope

60,350
16
207
386

votes

2 answers

What is the overhead of using Intel Last Branch Record?

Last Branch Record refers to a collection of register pairs (MSRs) that store the source and destination addresses related to recently executed branches. http://css.csail.mit.edu/6.858/2012/readings/ia32/ia32-3b.pdf document has more information in…

x86 intel trace branch-prediction intel-pmu

asked Feb 03 '13 at 08:07

user655617

votes

1 answer

Hardware cache events and perf

When I run perf list I see a bunch of Hardware Cache Events, as follows: $ perf list | grep 'cache event' L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware…

linux performance x86 perf intel-pmu

asked Sep 04 '18 at 16:58

BeeOnRope

60,350
16
207
386

votes

2 answers

Reliability of Xcode Instrument's disassembly time profiling

I've profiled my code using Instrument's time profiler, and zooming in to the disassembly, here's a snippet of its results: I wouldn't expect a mov instruction to take 23.3% of the time while a div instruction to take virtually nothing. This causes…

xcode x86 profiling instruments intel-pmu

asked Jan 21 '18 at 16:58

yairchu

23,680
7
69
109

votes

2 answers

Why does the number of uops per iteration increase with the stride of streaming loads?

Consider the following loop: .loop: add rsi, OFFSET mov eax, dword [rsi] dec ebp jg .loop where OFFSET is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the…

assembly x86 cpu-architecture intel-pmu

asked Sep 26 '18 at 23:25

Hadi Brais

22,259
3
54
95

votes

1 answer

Can the LSD issue uOPs from the next iteration of the detected loop?

I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop: BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 10000000 .loop: dec ecx ;| jz .end ;| 1…

assembly x86 cpu-architecture intel-pmu

asked Aug 28 '18 at 09:32

Margaret Bloom

41,768
5
78
124

votes

2 answers

Perf stat equivalent for Mac OS?

Is there a perf stat equivalent on Mac OS? I would like to do the same thing for a CLI command and googling is not yielding anything.

macos profiling performancecounter perf intel-pmu

asked Apr 06 '20 at 21:15

stk1234

1,036
3
12
29

votes

2 answers

rdpmc: surprising behavior

I'm trying to understand the rdpmc instruction. As such I have the following asm code: segment .text global _start _start: xor eax, eax mov ebx, 10 .loop: dec ebx jnz .loop mov ecx, 1<<30 ; calling rdpmc with ecx = (1<<30)…

performance assembly x86 performancecounter intel-pmu

asked May 17 '19 at 19:43

user14717

4,757
2
44
68

votes

0 answers

Why does Linux perf use event l1d.replacement for "L1 dcache misses" on x86?

On Intel x86, Linux uses the event l1d.replacements to implement its L1-dcache-load-misses event. This event is defined as follows: Counts L1D data line replacements including opportunistic replacements, and replacements that require…

linux x86 profiling perf intel-pmu

asked Sep 04 '18 at 20:20

BeeOnRope

60,350
16
207
386

votes

2 answers

How does one enable Intel Processor Tracing (IPT) in a virtualized environment?

I am attempting to run Alex Ionescu's WinIPT interface in a virtual machine, and having no success. (This is a Windows 10 Pro host running a Windows 10 VM and both are the 18363 update) I have successfully built and run Intel's driver as well as…

kernel intel trace virtualization intel-pmu

asked Feb 07 '20 at 21:24

echosys

votes

1 answer

Why are the user-mode L1 store miss events only counted when there is a store initialization loop?

Summary Consider the following loop: loop: movl $0x1,(%rax) add $0x40,%rax cmp %rdx,%rax jne loop where rax is initialized to the address of a buffer that is larger than the L3 cache size. Every iteration performs a store operation to…

x86 intel performancecounter cpu-cache intel-pmu

asked Mar 05 '19 at 02:59

Hadi Brais

22,259
3
54
95

votes

2 answers

Can we measure successful store-forwarding with Intel's performance counters?

Is it possible to measure the number of successful store-forwarding operations using the performance counters on recent Intel x86 chips? I see events for ld_blocks.store_forward which measure failed store-forwarding, but it's clear to me if the…

performance x86 intel-pmu

asked Sep 09 '17 at 22:54

BeeOnRope

60,350
16
207
386

2 3 4 5 6 7 Next