0

I am working on a project requiring profiling the target applications at first.

What I want to know is the exact time consumed by a loop body/function. The platform is BeagleBone Black board with Debian OS and installed perf_4.9.

gettimeofday() can provide a micro-second resolution but I still want more accurate results. It looks perf can give cycles statistics and thus be a good fit for purposes. However, perf can only analyze the whole application instead of individual loop/functions.

After trying the instructions posted in this Using perf probe to monitor performance stats during a particular function, it does not work well.

I am just wondering if there is any example application in C I can test and use on this board for my purpose. Thank you!

GeekTao
  • 11
  • 2
  • Use `perf record` / `perf report` instead of `perf stat`. For very short functions, "elapsed time" isn't meaningful because out-of-order exec means the time actually overlaps with the surrounding code. (And for larger functions, it's a bit fuzzy at the edges for the same reason, so words like "exact" aren't appropriate. You can get a high-resolution timestamp before/after the function, of course, and that will be meaningful if your function is more than a few hundred instructions executed. e.g. Skylake ROB size is 224 uops.) – Peter Cordes Jun 29 '21 at 20:02
  • 4
    Instead of `gettimeofday`, the cool kids use: `clock_gettime(CLOCK_MONOTONIC,...)` It provides nanosecond resolution. I've used that with great success for measuring performance [and tuning] of mission critical realtime commercial apps/systems, single process, multiprocess, and/or multithread – Craig Estey Jun 29 '21 at 20:02
  • @PeterCordes Thanks for the suggestions. Do you have any example of perf record/report? I search a little bit web but did not find the example use case for my purpose, i.e., measuring the time of a function. Somebody is using perf probe but the example seems not to work on BeagleBone black board or this Debian distribution as I saw some errors with the exact instructions they provided... – GeekTao Jul 01 '21 at 16:30
  • @CraigEstey Thanks a lot for advising a new function. I tried that but get a little unsure about its accuracy. Yes, it can offer nanosecond resolution but there is inconsistency. What I do is first use clock_gettime() to measure a loop body, the approximate elasped time would be around 1.4us in this way. Then I put GPIO instructions, pull high and pull down, at beginning and end of the loop body, respectively and measure the signal frequency on this GPIO with an oscilloscope. To my surprise, the frequency is around 1.8MHz, i.e., ~500ns. This inconsistency makes me a little confused... – GeekTao Jul 01 '21 at 16:32
  • `perf record ./a.out` uses the default event, `cycles`. `perf report` uses the default filename that record used, and is an interactive text-UI. See also https://perf.wiki.kernel.org/index.php/Tutorial#Sample_analysis_with_perf_report – Peter Cordes Jul 01 '21 at 16:33
  • 1
    If your function doesn't contain any branches (and maybe even if it does), you could use LBR to get exact cycle timings as described in this article: https://easyperf.net/blog/2019/04/03/Precise-timing-of-machine-code-with-Linux-perf. But there are other methods too. – nnnmmm Jul 01 '21 at 17:53

1 Answers1

1

Caveat: This is more of comment than an answer but it's a bit too long for just a comment.

Thanks a lot for advising a new function. I tried that but get a little unsure about its accuracy. Yes, it can offer nanosecond resolution but there is inconsistency.

There will be inconsistency if you use two different clock sources.

What I do is first use clock_gettime() to measure a loop body, the approximate elasped time would be around 1.4us in this way. Then I put GPIO instructions, pull high and pull down, at beginning and end of the loop body, respectively and measure the signal frequency on this GPIO with an oscilloscope.

A scope is useful if you're trying to debug the hardware. It can also show what's on the pins. But, in 40+ years of doing performance measurement/improvement/tuning, I've never used it to tune software.

In fact, I would trust the CPU clock more than I would trust the scope for software performance numbers

For a production product, you may have to measure performance on a system deployed at a customer site [because the issue only shows up on that one customer's machine]. You may have to debug this remotely and can't hook up a scope there. So, you need something that can work without external probe/test rigs.

To my surprise, the frequency is around 1.8MHz, i.e., ~500ns. This inconsistency makes me a little confused... – GeekTao

The difference could be just round off error based on different time bases and latency in getting in/out of the device (GPIO pins). I presume you're just using GPIO in this way to facilitate benchmark/timing. So, in a way, you're not measuring the "real" system, but the system with the GPIO overhead.

In tuning, one is less concerned with absolute values than relative. That is, clock_gettime is ultimately based on number of highres clock ticks (at 1ns/tick or better from the system's free running TSC (time stamp counter)). What the clock frequency actually is doesn't matter as much. If you measure a loop/function and get X duration. Then, you change some code and get X+n, this tells you whether the code got faster or slower.

500ns isn't that large an amount. Almost any system wide action (e.g. timeslicing, syscall, task switch, etc.) could account for that. Unless you've mapped the GPIO registers into app memory, the syscall overhead could dwarf that.

In fact, just the overhead of calling clock_gettime could account for that.

Although the clock_gettime is technically a syscall, linux will map the code directly into the app's code via the VDSO mechanism so there is no syscall overhead. But, even the userspace code has some calculations to do.

For example, I have two x86 PCs. On one system the overhead of the call is 26 ns. On another system, the overhead is 1800 ns. Both these systems are at least 2GHz+

For your beaglebone/arm system, the base clock rate may be less, so overhead of 500 ns may be ballpark.

I usually benchmark the overhead and subtract it out from the calculations.

And, on the x86, the actual code just gets the CPU's TSC value (via the rdtsc instruction) and does some adjustment. For arm, it has a similar H/W register but requires special care to map userspace access to it (a coprocessor instruction, IIRC).

Speaking of arm, I was doing a commercial arm product (an nVidia Jetson to be exact). We were very concerned about latency of incoming video frames.

The H/W engineer didn't trust TSC [or software in general ;-)] and was trying to use a scope, an LED [controlled by a GPIO pin] and when the LED flash/pulse showed up inside the video frame (e.g. the coordinates of the white dot in the video frame were [effectively] a time measurement).

It took a while to convince the engineer, but, eventually I was able to prove that the clock_gettime/TSC approach was more accurate/reliable.

And, certainly, easier to use. We had multiple test/development/SDK boards but could only hook up the scope/LED rig on one at a time.

Craig Estey
  • 30,627
  • 4
  • 24
  • 48