Using inlined/expanded assembly to obtain a rough estimate of cpu cycle cost of a function?

Question

I'm afraid I couldn't come up with a better term for what I want to ask about, other than "inlined/expanded assembly" - but let me try to explain through an example. The example will be for the RP2040 MCU (using pico-sdk), so ARM architecture - though it would be great to know if the approach can extend to other architectures.

Ultimately I want to obtain a rough estimate of a CPU cycle cost (i.e. in a rough sense, a "profile") of an interrupt service routine function - but to simplify matters, let's say in the example, I just want to estimate the CPU cycle cost of main(). Also, I would like to explore this in two contexts: "compile-time" context, and "run-time" context.

So, consider this simple main.c (note I deliberately do not use an endless while loop at the end of the main() function, so in reality this is a useless program, even if it builds):

#include <inttypes.h>         // uint32_t
#include "pico/stdlib.h"      // set_sys_clock_pll
#include "hardware/clocks.h"  // clk_sys
//#include "hardware/gpio.h"  // gpio_get

uint32_t actual_clock_sys_hz = 0;
static const void* gpio_get_ptr;

void main(void) {
  gpio_get_ptr = &gpio_get;   // obtain pointer to function
  set_sys_clock_pll(1596000000, 6, 2);          // try change sys clk to 133 MHz
  actual_clock_sys_hz = clock_get_hz(clk_sys);  // read clock back
}

Compile-time

Let's say I've built this code, and obtained an executable main.elf file. Here I can do:

arm-none-eabi-objdump -S main.elf > listing.txt

... and obtain a listing of assembly instructions, interleaved with C source code. From there, I can obtain a listing of the main function:

void main(void) {
100002f4:       b510            push    {r4, lr}
  gpio_get_ptr = &gpio_get;   // obtain pointer to function
100002f6:       4b07            ldr     r3, [pc, #28]   ; (10000314 <main+0x20>)
100002f8:       4a07            ldr     r2, [pc, #28]   ; (10000318 <main+0x24>)
100002fa:       601a            str     r2, [r3, #0]
  set_sys_clock_pll(1596000000, 6, 2);          // try change sys clk to 133 MHz
100002fc:       2202            movs    r2, #2
100002fe:       2106            movs    r1, #6
10000300:       4806            ldr     r0, [pc, #24]   ; (1000031c <main+0x28>)
10000302:       f000 f80f       bl      10000324 <set_sys_clock_pll>
  actual_clock_sys_hz = clock_get_hz(clk_sys);  // read clock back
10000306:       2005            movs    r0, #5
10000308:       f001 fa12       bl      10001730 <clock_get_hz>
1000030c:       4b04            ldr     r3, [pc, #16]   ; (10000320 <main+0x2c>)
1000030e:       6018            str     r0, [r3, #0]
10000310:       bd10            pop     {r4, pc}
10000312:       46c0            nop                     ; (mov r8, r8)

As a "first approximation", I could manually go through each of these assembly commands: push, ldr, str ...; then find how many clock cycles each of them takes (possibly in a best or worst case - I am aware that some branching instructions could take different amount of clock cycles depending on the condition value), then sum the clock cycles - and that would be a "rough estimate" of "how many clock cycles would this function take to execute".

Of course, this is an extremely simplistic view - for one, the bl (branch label) instruction is a "function call", so it "hides the details". Here, I could manually copy paste the assembly listings that objdump has provided for the given functions:

void main(void) {
100002f4:       b510            push    {r4, lr}
  gpio_get_ptr = &gpio_get;   // obtain pointer to function
100002f6:       4b07            ldr     r3, [pc, #28]   ; (10000314 <main+0x20>)
100002f8:       4a07            ldr     r2, [pc, #28]   ; (10000318 <main+0x24>)
100002fa:       601a            str     r2, [r3, #0]
  set_sys_clock_pll(1596000000, 6, 2);          // try change sys clk to 133 MHz
100002fc:       2202            movs    r2, #2
100002fe:       2106            movs    r1, #6
10000300:       4806            ldr     r0, [pc, #24]   ; (1000031c <main+0x28>)
10000302:       f000 f80f       bl      10000324 <set_sys_clock_pll>
->
  void set_sys_clock_pll(uint32_t vco_freq, uint post_div1, uint post_div2) {
  10000324:       b5f0            push    {r4, r5, r6, r7, lr}
  10000326:       b083            sub     sp, #12
  10000328:       0004            movs    r4, r0
  1000032a:       000d            movs    r5, r1
  1000032c:       0016            movs    r6, r2
      if (!running_on_fpga()) {
  1000032e:       f000 f865       bl      100003fc <running_on_fpga>
  10000332:       2800            cmp     r0, #0
  10000334:       d001            beq.n   1000033a <set_sys_clock_pll+0x16>
  ...
  10000380:       9700            str     r7, [sp, #0]
  10000382:       003b            movs    r3, r7
  10000384:       2202            movs    r2, #2
  10000386:       2100            movs    r1, #0
  10000388:       2006            movs    r0, #6
  1000038a:       f001 f8cf       bl      1000152c <clock_configure>
  }
  1000038e:       e7d2            b.n     10000336 <set_sys_clock_pll+0x12>
<-
  actual_clock_sys_hz = clock_get_hz(clk_sys);  // read clock back
10000306:       2005            movs    r0, #5
10000308:       f001 fa12       bl      10001730 <clock_get_hz>
->
  10001730 <clock_get_hz>:
  /// \tag::clock_get_hz[]
  uint32_t clock_get_hz(enum clock_index clk_index) {
      return configured_freq[clk_index];
  10001730:       4b01            ldr     r3, [pc, #4]    ; (10001738 <clock_get_hz+0x8>)
  10001732:       0080            lsls    r0, r0, #2
  10001734:       58c0            ldr     r0, [r0, r3]
  }
  10001736:       4770            bx      lr
  10001738:       200007e0        .word   0x200007e0
<-
1000030c:       4b04            ldr     r3, [pc, #16]   ; (10000320 <main+0x2c>)
1000030e:       6018            str     r0, [r3, #0]
10000310:       bd10            pop     {r4, pc}
10000312:       46c0            nop                     ; (mov r8, r8)

So, with this manual copy-pasting, I'd say I had obtained an "expanded" assembly listing of the function, by "inlining" the assembly code for the given functions, where there is otherwise a call to these functions in the function I'm "profiling" (hence the wording in the title).

Of course, the story doesn't end here, because also the "first-level" "inlined" functions might end up calling other functions, and so on - recursively - so in the end, a manual copy-paste for this kind of thing will be unfeasible (especially since I'm not exactly an assembly expert, so I might not even recognize what I should or should not copy-paste as part of this "inlining"), and I'd much rather have a tool do it for me. So my first question is:

Is there a tool (like gcc, objdump ...) with an option, that could obtain, in a text file, such a "recursively" "expanded"/"inlined" listing of compiled assembly code for a function - and is there an option to calculate (best or worst) the total CPU cycle count utilization of the sum of commands in the resulting assembly listing?

Note that I'm aware that this kind of estimate might not correspond to reality well: e.g. you might end up with a function that realistically, 90% of the time gets called in a way that it jumps to its end and thus uses say only 32 CPU cycle, even if the "expanded" listing results with thousands of assembly commands (or you might end up in an endless loop, and then the whole estimate thing doesn't apply). But this would simply say "these are all the possible instructions this function may conceivably run through, and this is the sum of the (best or worst) CPU cycles they would otherwise use individually".

Run-time

At runtime, the function will take one or the other path wherever there is branching, and in the end, we'd have one particular "thread" or "path" of execution that the function has executed - so it should be possible to find the actual CPU cycles that were used, to execute this particular "path".

I think here, it would be best if one could use the gdb record functionality; unfortunately, for my platform, when I setup a gdb session and stop at a breakpoint of a running program on the MCU, I get this:

(gdb) help record
record, rec
Start recording.

List of record subcommands:

record btrace, record b -- Start branch trace recording.
record delete, record del, record d -- Delete the rest of execution log and start recording it anew.
...

(gdb) record save trace.rec
No recording is currently active.
Use the "record full" or "record btrace" command first.

(gdb) record full
Process record: the current architecture doesn't support record function.

(gdb) record btrace
Target does not support branch tracing.

Oh, well, - that would have been nice, but, no dice this time.

Then again, I'm aware that if you open the gdb tui, you'll get the assembly listing of the program wherever it has stopped, and if you step here, you will step through the assembly commands one by one.

So I could conceivably do the following here:

Set up a breakpoint at entry of main
Set up (somehow) an "exit point" at the return (or final instruction) of main
Tell gdb (somehow) "step through each assembly command (as in TUI) and record it in text file, from here (breakpoint) to the "exit point", then break again)

This should now get me a listing of actual execution (or at least, one possible path of it), so it would be known which paths branches have taken, and exactly how many CPU cycles those took - so it should be possible to obtain exact CPU cycle count for that particular function execution run.

Of course, I'm aware this will be slow (as the PC would have to "remotely" instruct the MCU to run, then stop, then run, etc), to the point of unusability (especially if some branches depend of live GPIO input values, that might change by the time gdb got to the instruction that needs them) - but still, you'd get one actual execution path, and you can calculate exactly how many CPU cycles it had taken. So my second question is:

Is there a tool (like gdb, others? ...) with an option, that could obtain, in a text file, a listing of assembly commands that live program code had gone through (i.e. a recording) at runtime for a given function call - and is there an option to calculate the total CPU cycle count utilization for that particular execution?

Note that I'm also aware that even without the "changing inputs" issue, the CPU cycle count sum obtained in this way might not correspond to the actual time taken in reality - since the function could be randomly pre-empted by an interrupt, and thus, "observed on an oscilloscope", it would take longer than what this kind of a CPU cycle count would suggest - but at least, this estimate would hold as a minimum duration due to the CPU cycle execution cost of that particular execution path.

You do not want to do this by hand. If you cannot measure on an actual processor, there are simulators for estimating performance, although I cannot speak to the availability of one for your particular circumstance. One reason for not doing it by hand is that the number of cycle an instruction takes, from its start of execution to its end of execution, generally differs from how many cycles it delays other instructions. E.g., an instruction may take four cycles but be fully pipelined so that one such instruction can start every cycle. So it only holds things up by one cycle… — Eric Postpischil, Jun 16 '23 at 14:03
… However, instructions whose inputs are outputs from that instruction have to wait for those results to be available, so they could be held up more cycles. And the instruction executes in a certain execution unit of the processor. So other instructions that execute in different units can start without waiting. So you need to classify each instruction and match them with execution units. Except then you need to check whether they use any common resources that would hold them up even if they are otherwise executing in separate units. Are all the rename registers in use?… — Eric Postpischil, Jun 16 '23 at 14:04
… Is instruction dispatch stalled? This is why we measure actual performance and/or use simulators. It is too complicated to do by hand. Experienced professionals working with short sequences of simplified code (e.g., a short loop of mostly floating-point arithmetic instructions and maybe some loads and a branch or two) can estimate because those circumstances eliminate or reduce some of the complications. But things much beyond that require computer help. And the above does not even discuss cache and memory latencies. — Eric Postpischil, Jun 16 '23 at 14:05
In addition to Eric's comments I'll say you don't want to use `gdb` for this kind of job. Modern embedded systems use monitoring task design to estimate the cpu utilization. The idea is to monitor the time spent in the idle task and extrapolate the cpu utilization from that. If you want more information I recommend [this article](https://www.embedded.com/how-to-calculate-cpu-utilization/). — panic, Jun 16 '23 at 14:42
Related: [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) except your MCU is probably in-order so a bit simpler. But also more complicated because it might depend on memory wait-states if it doesn't have caches (or there are cache misses), and low-end CPUs/MCUs have less hardware to hide delays in fetching code or data. If you had an *accurate* performance model of your MCU, you could simulate it by hand or with a simulator. — Peter Cordes, Jun 16 '23 at 19:20

Using inlined/expanded assembly to obtain a rough estimate of cpu cycle cost of a function?

Compile-time

Run-time

0 Answers0