I'm afraid I couldn't come up with a better term for what I want to ask about, other than "inlined/expanded assembly" - but let me try to explain through an example. The example will be for the RP2040 MCU (using pico-sdk), so ARM architecture - though it would be great to know if the approach can extend to other architectures.
Ultimately I want to obtain a rough estimate of a CPU cycle cost (i.e. in a rough sense, a "profile") of an interrupt service routine function - but to simplify matters, let's say in the example, I just want to estimate the CPU cycle cost of main()
. Also, I would like to explore this in two contexts: "compile-time" context, and "run-time" context.
So, consider this simple main.c
(note I deliberately do not use an endless while
loop at the end of the main()
function, so in reality this is a useless program, even if it builds):
#include <inttypes.h> // uint32_t
#include "pico/stdlib.h" // set_sys_clock_pll
#include "hardware/clocks.h" // clk_sys
//#include "hardware/gpio.h" // gpio_get
uint32_t actual_clock_sys_hz = 0;
static const void* gpio_get_ptr;
void main(void) {
gpio_get_ptr = &gpio_get; // obtain pointer to function
set_sys_clock_pll(1596000000, 6, 2); // try change sys clk to 133 MHz
actual_clock_sys_hz = clock_get_hz(clk_sys); // read clock back
}
Compile-time
Let's say I've built this code, and obtained an executable main.elf
file. Here I can do:
arm-none-eabi-objdump -S main.elf > listing.txt
... and obtain a listing of assembly instructions, interleaved with C source code. From there, I can obtain a listing of the main
function:
void main(void) {
100002f4: b510 push {r4, lr}
gpio_get_ptr = &gpio_get; // obtain pointer to function
100002f6: 4b07 ldr r3, [pc, #28] ; (10000314 <main+0x20>)
100002f8: 4a07 ldr r2, [pc, #28] ; (10000318 <main+0x24>)
100002fa: 601a str r2, [r3, #0]
set_sys_clock_pll(1596000000, 6, 2); // try change sys clk to 133 MHz
100002fc: 2202 movs r2, #2
100002fe: 2106 movs r1, #6
10000300: 4806 ldr r0, [pc, #24] ; (1000031c <main+0x28>)
10000302: f000 f80f bl 10000324 <set_sys_clock_pll>
actual_clock_sys_hz = clock_get_hz(clk_sys); // read clock back
10000306: 2005 movs r0, #5
10000308: f001 fa12 bl 10001730 <clock_get_hz>
1000030c: 4b04 ldr r3, [pc, #16] ; (10000320 <main+0x2c>)
1000030e: 6018 str r0, [r3, #0]
10000310: bd10 pop {r4, pc}
10000312: 46c0 nop ; (mov r8, r8)
As a "first approximation", I could manually go through each of these assembly commands: push
, ldr
, str
...; then find how many clock cycles each of them takes (possibly in a best or worst case - I am aware that some branching instructions could take different amount of clock cycles depending on the condition value), then sum the clock cycles - and that would be a "rough estimate" of "how many clock cycles would this function take to execute".
Of course, this is an extremely simplistic view - for one, the bl
(branch label) instruction is a "function call", so it "hides the details". Here, I could manually copy paste the assembly listings that objdump
has provided for the given functions:
void main(void) {
100002f4: b510 push {r4, lr}
gpio_get_ptr = &gpio_get; // obtain pointer to function
100002f6: 4b07 ldr r3, [pc, #28] ; (10000314 <main+0x20>)
100002f8: 4a07 ldr r2, [pc, #28] ; (10000318 <main+0x24>)
100002fa: 601a str r2, [r3, #0]
set_sys_clock_pll(1596000000, 6, 2); // try change sys clk to 133 MHz
100002fc: 2202 movs r2, #2
100002fe: 2106 movs r1, #6
10000300: 4806 ldr r0, [pc, #24] ; (1000031c <main+0x28>)
10000302: f000 f80f bl 10000324 <set_sys_clock_pll>
->
void set_sys_clock_pll(uint32_t vco_freq, uint post_div1, uint post_div2) {
10000324: b5f0 push {r4, r5, r6, r7, lr}
10000326: b083 sub sp, #12
10000328: 0004 movs r4, r0
1000032a: 000d movs r5, r1
1000032c: 0016 movs r6, r2
if (!running_on_fpga()) {
1000032e: f000 f865 bl 100003fc <running_on_fpga>
10000332: 2800 cmp r0, #0
10000334: d001 beq.n 1000033a <set_sys_clock_pll+0x16>
...
10000380: 9700 str r7, [sp, #0]
10000382: 003b movs r3, r7
10000384: 2202 movs r2, #2
10000386: 2100 movs r1, #0
10000388: 2006 movs r0, #6
1000038a: f001 f8cf bl 1000152c <clock_configure>
}
1000038e: e7d2 b.n 10000336 <set_sys_clock_pll+0x12>
<-
actual_clock_sys_hz = clock_get_hz(clk_sys); // read clock back
10000306: 2005 movs r0, #5
10000308: f001 fa12 bl 10001730 <clock_get_hz>
->
10001730 <clock_get_hz>:
/// \tag::clock_get_hz[]
uint32_t clock_get_hz(enum clock_index clk_index) {
return configured_freq[clk_index];
10001730: 4b01 ldr r3, [pc, #4] ; (10001738 <clock_get_hz+0x8>)
10001732: 0080 lsls r0, r0, #2
10001734: 58c0 ldr r0, [r0, r3]
}
10001736: 4770 bx lr
10001738: 200007e0 .word 0x200007e0
<-
1000030c: 4b04 ldr r3, [pc, #16] ; (10000320 <main+0x2c>)
1000030e: 6018 str r0, [r3, #0]
10000310: bd10 pop {r4, pc}
10000312: 46c0 nop ; (mov r8, r8)
So, with this manual copy-pasting, I'd say I had obtained an "expanded" assembly listing of the function, by "inlining" the assembly code for the given functions, where there is otherwise a call to these functions in the function I'm "profiling" (hence the wording in the title).
Of course, the story doesn't end here, because also the "first-level" "inlined" functions might end up calling other functions, and so on - recursively - so in the end, a manual copy-paste for this kind of thing will be unfeasible (especially since I'm not exactly an assembly expert, so I might not even recognize what I should or should not copy-paste as part of this "inlining"), and I'd much rather have a tool do it for me. So my first question is:
- Is there a tool (like
gcc
,objdump
...) with an option, that could obtain, in a text file, such a "recursively" "expanded"/"inlined" listing of compiled assembly code for a function - and is there an option to calculate (best or worst) the total CPU cycle count utilization of the sum of commands in the resulting assembly listing?
Note that I'm aware that this kind of estimate might not correspond to reality well: e.g. you might end up with a function that realistically, 90% of the time gets called in a way that it jumps to its end and thus uses say only 32 CPU cycle, even if the "expanded" listing results with thousands of assembly commands (or you might end up in an endless loop, and then the whole estimate thing doesn't apply). But this would simply say "these are all the possible instructions this function may conceivably run through, and this is the sum of the (best or worst) CPU cycles they would otherwise use individually".
Run-time
At runtime, the function will take one or the other path wherever there is branching, and in the end, we'd have one particular "thread" or "path" of execution that the function has executed - so it should be possible to find the actual CPU cycles that were used, to execute this particular "path".
I think here, it would be best if one could use the gdb record functionality; unfortunately, for my platform, when I setup a gdb
session and stop at a breakpoint of a running program on the MCU, I get this:
(gdb) help record
record, rec
Start recording.
List of record subcommands:
record btrace, record b -- Start branch trace recording.
record delete, record del, record d -- Delete the rest of execution log and start recording it anew.
...
(gdb) record save trace.rec
No recording is currently active.
Use the "record full" or "record btrace" command first.
(gdb) record full
Process record: the current architecture doesn't support record function.
(gdb) record btrace
Target does not support branch tracing.
Oh, well, - that would have been nice, but, no dice this time.
Then again, I'm aware that if you open the gdb tui, you'll get the assembly listing of the program wherever it has stopped, and if you s
tep here, you will step through the assembly commands one by one.
So I could conceivably do the following here:
- Set up a breakpoint at entry of
main
- Set up (somehow) an "exit point" at the return (or final instruction) of
main
- Tell
gdb
(somehow) "step through each assembly command (as in TUI) and record it in text file, from here (breakpoint) to the "exit point", then break again)
This should now get me a listing of actual execution (or at least, one possible path of it), so it would be known which paths branches have taken, and exactly how many CPU cycles those took - so it should be possible to obtain exact CPU cycle count for that particular function execution run.
Of course, I'm aware this will be slow (as the PC would have to "remotely" instruct the MCU to run, then stop, then run, etc), to the point of unusability (especially if some branches depend of live GPIO input values, that might change by the time gdb
got to the instruction that needs them) - but still, you'd get one actual execution path, and you can calculate exactly how many CPU cycles it had taken. So my second question is:
- Is there a tool (like
gdb
, others? ...) with an option, that could obtain, in a text file, a listing of assembly commands that live program code had gone through (i.e. a recording) at runtime for a given function call - and is there an option to calculate the total CPU cycle count utilization for that particular execution?
Note that I'm also aware that even without the "changing inputs" issue, the CPU cycle count sum obtained in this way might not correspond to the actual time taken in reality - since the function could be randomly pre-empted by an interrupt, and thus, "observed on an oscilloscope", it would take longer than what this kind of a CPU cycle count would suggest - but at least, this estimate would hold as a minimum duration due to the CPU cycle execution cost of that particular execution path.