PAPI (Performance Application Programming Interface) provides the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. PAPI enables software engineers to see, in near real time, the relation between software performance and processor events.
Questions tagged [papi]
86 questions
17
votes
2 answers
Increased number of cache misses when vectorizing code
I vectorized the dot product between 2 vectors with SSE 4.2 and AVX 2, as you can see below. The code was compiled with GCC 4.8.4 with the -O2 optimization flag. As expected the performance got better with both (and AVX 2 faster than SSE 4.2), but…

fc67
- 409
- 5
- 17
15
votes
1 answer
Why does Perf and Papi give different values for L3 cache references and misses?
I am working on a project where we have to implement an algorithm that is proven in theory to be cache friendly. In simple terms, if N is the input and B is the number of elements that get transferred between the cache and the RAM every time we have…

jsguy
- 2,069
- 1
- 25
- 36
13
votes
1 answer
papi_avail: No events available
I want to get into PAPI. I have Version 5.3.2.0 on Debian GNU/Linux. papi_avail just tells me that no hardware events are available:
$ papi_avail
Available events and hardware…

Monkey Supersonic
- 1,165
- 1
- 10
- 19
9
votes
1 answer
How to measure overall performance of parallel programs (with papi)
I asked myself what would be the best way to measure the performance (in flops) of a parallel program. I read about papi_flops. This seems to work fine for a serial program. But I don't know how I can measure the overall performance of a parallel…

Sebastian
- 153
- 1
- 8
8
votes
2 answers
Using rdmsr/rdpmc for branch prediction accuracy
I am trying to understand how does a branch prediction unit work in a CPU.
I have used papi and also linux's perf-events but both of them do not give accurate results (for my case).
This is my code:
void func(int* arr, int sequence_len){
for(int i…
user12527223
7
votes
1 answer
Why does _mm_mfence() produce counts for the ALL_LOADS perf event?
I am testing some of intrinsic operations' behaviors. I got surprised when I noticed that _mm_mfence() issues load instruction from user space, but it does not count in L1 data cache - miss, hit or fill buffer hit. I am using papi's native events…

Ana Khorguani
- 896
- 4
- 18
7
votes
1 answer
Find out how many hardware performance counters a CPU has
On an Intel or AMD x86-64 system running Linux, where/how can I find out the number of hardware performance counters that my CPU has?
I would like to use the Linux perf tool to gather hardware performance counter data while executing some…

pjhsea
- 841
- 9
- 13
6
votes
1 answer
PAPI: what does Clock reference cycles mean?
i am using PAPI liberary to tune and profile my application.
I want to know what (PAPI_REF_CYC : Reference clock cycles ) means actually?
Thanks in advance,

abdul
- 75
- 5
5
votes
0 answers
Measure L1 data cache miss with perf and papi
What is the difference between PAPI_L1_LDM in papi and L1-dcache-load-misses in perf?
I've used the same setting, like this post here.
So, as a result I get for papi:
PAPI_L1_DCM: 515 <- L1 data cache miss (probably L1D_READ_MISSES_ALL +…

boraas
- 929
- 1
- 10
- 24
4
votes
1 answer
How to use PAPI periodically for performance measurements
I want to analyze system's performance for my application using PAPI api in C. The general structure is that
-- Initialize PAPI
-- Initialize counters of interest
-- start counters
-- run main logic of the application
-- end counters…

marc
- 949
- 14
- 33
4
votes
1 answer
perf_event for multiple threads in a process
I'm trying to profile multiple threads within a given process using perf. It does appear though with the code below that even though pid argument to perf_event_open is 0 (which should result in profiling of the process as a whole ?), the HW counter…

user3882729
- 1,339
- 8
- 11
4
votes
1 answer
counting L1 cache misses with PAPI_read_counters gives unexpected results
I am trying to use PAPI library to count cache misses. cache hit performance counter is not available on my hardware, that's why I am trying to determine cache hits with no cache misses. I am trying few things. First version of my code is this:
…

Ana Khorguani
- 896
- 4
- 18
4
votes
3 answers
How can we know the exact number of the hardware performance counters built-in CPU?
After I have done several reading on Hardware Performance Counter, I can claim that all of the Intel processors have supported with Hardware Performance Counter. So, In order to access these additional hardware registers ,i.e. hardware performance…

M.Mrd
- 41
- 2
4
votes
0 answers
How to monitor the utilization of cores on Xeon Phi at 10Hz?
I've been trying to measure/monitor the utilization of all those 60 cores on Xeon Phi (Knights Corner, in-order processors) at a relatively high frequency, say, at least every 0.1s which yields to 10Hz.
I tried the latest PAPI library. But it only…

thierry
- 217
- 2
- 12
4
votes
1 answer
Justify Memory Access in a cycle
I have the following function:
void ikj(float (*a)[N], float (*b)[N], float (*c)[N], int n) {
int i, j, k;
float r;
papi_start();
for (i = 0; i < n; i++) {
for (k = 0; k < n; k++) {
r = a[i][k];
…

José Ricardo Ribeiro
- 709
- 9
- 23