3

I'm profiling a toy program (selection sort) using perf and I wonder what iterations correspond to in the perf report output. The addresses it show correspond to the inner loop and if statement. I hope somebody can help. Also, the basic block cycles column disappear when I use " -b --branch-history" with perf. I don't know why.

This is the portion of my code getting sampled (MAX_LENGTH is 500):

   35 // FROM: https://www.geeksforgeeks.org/selection-sort
   37 void swap(int *xp, int *yp)
   38 {
   39     int temp = *xp;
   40     *xp = *yp;
   41     *yp = temp;
   42 }
   43       
   44 void selection_sort(int arr[])
   45 {
   46     int i, j, min_idx;
   47
   48     // One by one move boundary of unsorted subarray
   49     for (i = 0; i < MAX_LENGTH-1; i++)
   50     {
   51         // Find the minimum element in unsorted array
   52         min_idx = i;
   53         for (j = i+1; j < MAX_LENGTH; j++)
   54           if (arr[j] < arr[min_idx])
   55             min_idx = j;
   56
   57         // Swap the found minimum element with the first element
   58         swap(&arr[min_idx], &arr[i]);
   59     }
   60 }

compiled using (clang version 5.0.0):

clang -O0 -g selection_sort.c -o selection_sort_g_O0

Here's how I invoke perf record:

sudo perf record -e cpu/event=0xc4,umask=0x20,name=br_inst_retired_near_taken,period=1009/pp -b -g ./selection_sort_g_O0

perf report and its output:

sudo perf report -b --branch-history --no-children

Samples: 376  of event 'br_inst_retired_near_taken', Event count (approx.): 37603384
  Overhead  Source:Line               Symbol                  Shared Object                                                                                                                                         ▒
+   51.86%  selection_sort_g_O0[862]  [.] 0x0000000000000862  selection_sort_g_O0                                                                                                                                   ▒
-   24.47%  selection_sort_g_O0[86e]  [.] 0x000000000000086e  selection_sort_g_O0                                                                                                                                   ▒
     0x873 (cycles:1)                                                                                                                                                                                               ▒
   - 0x86e (cycles:1)                                                                                                                                                                                               ▒
      - 23.94% 0x86e (cycles:3 iterations:25)                                                                                                                                                                       ▒
           0x862 (cycles:3)                                                                                                                                                                                         ▒
           0x83f (cycles:1)                                                                                                                                                                                         ▒
           0x87c (cycles:1)                                                                                                                                                                                         ▒
           0x873 (cycles:1)                                                                                                                                                                                         ▒
           0x86e (cycles:1)                                                                                                                                                                                         ▒
           0x86e (cycles:3)                                                                                                                                                                                         ▒
           0x862 (cycles:3)                                                                                                                                                                                         ▒
           0x83f (cycles:1)                                                                                                                                                                                         ▒
           0x87c (cycles:1)                                                                                                                                                                                         ▒
           0x873 (cycles:1)                                                                                                                                                                                         ▒
           0x86e (cycles:1)                                                                                                                                                                                         ▒
           0x86e (cycles:3)                                                                                                                                                                                         ▒
           0x862 (cycles:3)                                                                                                                                                                                         ▒
+   22.61%  selection_sort_g_O0[87c]  [.] 0x000000000000087c  selection_sort_g_O0                                                                                                                                   ▒
+    1.06%  selection_sort_g_O0[8a5]  [.] 0x00000000000008a5  selection_sort_g_O0

I used objdump for a mapping between addresses and source file lines:

objdump -Dleg selection_sort_g_O0 > selection_sort_g_O0.s

../selection_sort.c:53
 836:   8b 45 f4                mov    -0xc(%rbp),%eax
 839:   83 c0 01                add    $0x1,%eax
 83c:   89 45 f0                mov    %eax,-0x10(%rbp)
 83f:   81 7d f0 f4 01 00 00    cmpl   $0x1f4,-0x10(%rbp)
 846:   0f 8d 35 00 00 00       jge    881 <selection_sort+0x71>
../selection_sort.c:54
 84c:   48 8b 45 f8             mov    -0x8(%rbp),%rax
 850:   48 63 4d f0             movslq -0x10(%rbp),%rcx
 854:   8b 14 88                mov    (%rax,%rcx,4),%edx
 857:   48 8b 45 f8             mov    -0x8(%rbp),%rax
 85b:   48 63 4d ec             movslq -0x14(%rbp),%rcx
 85f:   3b 14 88                cmp    (%rax,%rcx,4),%edx
 862:   0f 8d 06 00 00 00       jge    86e <selection_sort+0x5e>
../selection_sort.c:55
 868:   8b 45 f0                mov    -0x10(%rbp),%eax
 86b:   89 45 ec                mov    %eax,-0x14(%rbp)
../selection_sort.c:54
 86e:   e9 00 00 00 00          jmpq   873 <selection_sort+0x63>
../selection_sort.c:53
 873:   8b 45 f0                mov    -0x10(%rbp),%eax
 876:   83 c0 01                add    $0x1,%eax
 879:   89 45 f0                mov    %eax,-0x10(%rbp)
 87c:   e9 be ff ff ff          jmpq   83f <selection_sort+0x2f>
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
drunk teapot
  • 303
  • 2
  • 13
  • 1
    Do you actually need `sudo` for this? I forget what sysctl `kernel.perf_even_paranoid = 0` actually does, but I normally don't have to use `sudo` to profile stuff. – Peter Cordes Aug 01 '18 at 20:18
  • What input are you providing to `selection_sort`? What is your CPU microarchitecture, perf version, kernel version? I'm not able to reproduce the same output. – Hadi Brais Aug 01 '18 at 21:29
  • I do need sudo. I haven't paid attention to "why" but my system is archlinux 4.13,12. perf version is 4.13 (looks like I need an update). machine is a zenbook pro: i7-6700HQ CPU @ 2.60GHz, 8 cores – drunk teapot Aug 01 '18 at 21:58
  • as input I generated random 500 numbers and put it into and array. the code does NOT use rand(), I generated the numbers simply by bash $RANDOM and put them into the code: an array with an initialization list. I tried to keep everything deterministic so that the profiling info might make more sense – drunk teapot Aug 01 '18 at 22:05

2 Answers2

2

I will try to reiterate and add some more information on top of Zulan's answer.

Last Branch Records (LBRs) allow finding the hot execution paths in an executable to examine them directly for optimization opportunities. In perf this is implemented by extending the call-stack display mechanism and adding the last basic blocks into the call stack, which is normally used to display the most common hierarchy of function calls.

This can be done by using the call graph (-g) and LBR (-b) options in perf record and the --branch-history option in perf report, which adds the last branch information to the call graph. Essentially it gives 8-32 branches extra context of why something happened.

The timed LBR feature in recent perf versions reports the average number of cycles per basic blocks.

What is Iterations ?

From what I can understand, the branch history code has a loop detection function. This allows us to obtain the number of iterations by calculating the number of removed loops. The removal of repeated loops was introduced only in perf report output (to display it in a histogram format) via a previous commit in the Linux kernel.

struct iterations is a useful C struct to be used to display the number of iterations in perf report.

This is where the number of iterations are being saved to be displayed in your perf report output. The save_iterations function is being called from inside the remove_loops function.

The loops are being removed at the time of resolving the callchain.

You can also read this commit which describes how the perf report displays the number of iterations and changes that have been introduced in the newer Linux kernel versions.

Arnabjyoti Kalita
  • 2,325
  • 1
  • 18
  • 31
  • Thank you! So, assuming that I managed to to get the period that I asked for, how can I estimate the trip count of my loop? I was seeing 25 for a period of 1009 and this loop actually executes for (499*500)/2 = 124750. I'm aware that I'm only sampling. I'll try with a loop having a constant trip count throughout its execution. – drunk teapot Aug 09 '18 at 04:15
  • I am not quite sure what "loop trip count" means. But what I can say is you are only sampling over a period of 1009. You generate a sample every 1009th event - so in effect you are not counting. However, using the branch history does allow you to see the number of cycles at a much lower granularity. – Arnabjyoti Kalita Aug 09 '18 at 10:07
  • ok, so my goal was to estimate the total number of iterations (a.k.a loop trip count) that a loop took using the sampling period and iterations reported by perf. – drunk teapot Aug 14 '18 at 16:46
1

Take this answer with a grain of salt as I pieced together information from quickly glancing at the perf source code and commits.

perf report --branch-history tries to build a control flow of the program based on the branch records in the trace. When doing so it also detects loops. However, the information from perf record is probably incomplete - so the loop detection will also be erroneous.

If your system is anything like mine (Haswell desktop, Linux 4.17.6), perf will probably record the LBR for each sample taken. If this LBR contains the 16 most recent branches, you would have to decrease the event period to 16 retired branches. Even with convincing the kernel not to throttle recording, I was not able to achieve any meaningful results.

Unfortunately I am unaware of a way to record a full branch trace with perf.

Zulan
  • 21,896
  • 6
  • 49
  • 109
  • 1
    thanks for the info. However, I didn't quite understand "If this LBR contains the 16 most recent branches, you would have to decrease the event period to 16 retired branches". About data collected being erroneous, I agree. First of all we're sampling and second, LBR is known to be erroneous: https://ieeexplore.ieee.org/document/8366953/. I honestly, couldn't figure out more to understand the perf report data that I'm looking at – drunk teapot Aug 01 '18 at 22:13
  • 1
    My point is that if you can collect the 16 most recent branches for each sample, you must take a sample every 16 branches. Take a look at `perf script -D` to see the raw data, that might give some insight as to what's behind the curtain. – Zulan Aug 01 '18 at 22:15
  • ok, I played with the period: 499, 277, 97 and 31. I still see iterations as 25, 24 around the same addresses. I was expecting it to change. Maybe I should I conclude that these "iterations" have nothing with the loop iterations. – drunk teapot Aug 02 '18 at 15:48
  • Check your `dmesg` for `perf: interrupt took too long`. If you find it, you likely didn't get the period you requested. – Zulan Aug 02 '18 at 18:39
  • thank you for the suggestion, I watched my dmesg and you were right. For the example I posted, I didn't observed this but for my other toy examples which I stressed on loop trip counts I did. So, I couldn't confirm the sanity of my experiments – drunk teapot Aug 09 '18 at 03:56