How to use qemu to do profiling on a algorithm

Question

I have a program run well on Ubuntu now. The program is written purely in C. And it will finally run on a embedded processor. I hope to know its execution speed on different target, like Cortex M3, M4 or A series. As there are pretty much double type arithmatic, the difference should be obvious. Currently, my idea is to use qemu to count the instruction executed for some set of data. As the program is only about data processing, the only required resource should be RAM.

I don't need the very accurate result, as it will only serve as a guide to choose CPU. Is there some easy guide for the task? I have little experience with qemu. I saw there are two ways to invoke qemu: qemu-system-arm and qemu-user. I guess the most accurate simulation result should be got by qemu-system-arm. What's more, Cortex M series should not support Linux due to lack of MMU, right?

Just to be sure, you want to know how your program will run on different types of CPUs but you only care about the RAM consummation? — Tzig, Aug 30 '21 at 14:01
There is nothing like ARM M. Do you mean Cortex-M? If yes it is possible to port linux if you add some external RAM. Lack of MMU is not a problem — 0___________, Aug 30 '21 at 14:01
I care about the instruction count to run the the program again some data once. — Zhang Li, Aug 30 '21 at 15:36
Yes, I mean Cortex-M. M3 lacks float number processing unit. So I want to see how much it affect. — Zhang Li, Aug 30 '21 at 15:38
You can run the system and then run your program under gdb and do https://stackoverflow.com/questions/21628002/counting-machine-instructions-using-gdb . — KamilCuk, Aug 30 '21 at 17:38
WoW, GDB could do that. So that will be a wasy way. Do you know if there is a easy way to setup a environment, where I can cross-compile my program into the QEMU rootfs and use gdb to do the profiling? Especially for Cortext M3, M4 and A7. — Zhang Li, Aug 30 '21 at 23:20

score 6 · Answer 1 · answered Aug 30 '21 at 17:20

There's not a lot out there on how to do this because it is in general pretty difficult to do profiling of guest code on an emulated CPU/system and get from that useful information about performance on real hardware. This is because performance on real hardware is typically strongly dependent on events which most emulation (and in particular QEMU) does not model, such as:

branch mispredictions
cache misses
TLB misses
memory latency

as well as (usually less significantly than the above) differences in number of cycles between instructions -- for instance on the Cortex-M4 VMUL.F32 is 1 cycle but VDIV.F32 is 14.

For a Cortex-M CPU the hardware is simple enough (ie no cache, no MMU) that a simple instruction count may not be too far out from real-world performance, but for an A-class core instruction count alone is likely to be highly misleading.

The other approach people sometimes want to take is to measure run-time under a model; this can be even worse than counting instructions, because some things that are very fast on real hardware are very slow in an emulator (eg floating point instructions), and because the JIT process introduces extra overhead at unpredictable times.

On top of the conceptual difficulties, QEMU is not currently a very helpful environment for obtaining information like instruction counts. You can probably do something with the TCG plugin API (if you're lucky one of the example plugins may be sufficient).

In summary, if you want to know the performance of a piece of code on specific hardware, the easiest and most accurate approach is to run and profile the code on the real hardware.

Thank you for your thorough input. A real hardware will be too costly when you are still choosing CPU yet. If I understand it correctly, counting instuction should be a benchmark with the same order as the real HW. I think that is enough for me. — Zhang Li, Aug 30 '21 at 23:16
No, counting instructions quite possibly will not get you same-order answers as real hardware. Real hardware is pretty cheap these days, especially if you just want a "basically the right CPU" and you can live with doing performance testing on a cheap-and-cheerful development board. — Peter Maydell, Aug 31 '21 at 13:42
Why? Consider the worst case is something like VDIV.F32, 14 cycles, and the average of the instructions should be something between 1 and 2. That is the DMIPS/MHz for a typical program, right? — Zhang Li, Sep 01 '21 at 22:46
You mention the Cortex-A7 in another comment. That has a cache. Cache misses can be extremely expensive. Also, "typical program" is not "the program I want to profile"... — Peter Maydell, Sep 02 '21 at 09:29
Thank you for your thorough feedback. One more question, what is the best method for such an action on real HW? Use an internal timer to measure the time cost? — Zhang Li, Sep 02 '21 at 13:38

score 1 · Answer 2 · answered Sep 03 '21 at 15:18

I post my solution here, in case someone just want a rough estimation as me. Eclipse embedded CDT provides a good start point. You can start with a simple LED blink template. It support soft FP arithmatic only now. You can start qemu with the built embedded program, and a picture of the STM32F407 board will appear. The LED on the picture will blink as the program goes.

The key point is I can use the script from Counting machine instructions using gdb to count instruction on the qemu target.

However, it seems eclipse embedded cdt will stuck when some library code is executed. Here is my work around, start qemu mannually(the command is got by command 'ps' when eclipse start qemu): In the first terminal:

qemu-system-gnuarmeclipse --verbose --verbose --board STM32F4-Discovery --mcu STM32F407VG --gdb tcp::1235 -d unimp,guest_errors --semihosting-config enable=on,target=native --semihosting-cmdline blinky_c

Then in the second terminal:

arm-none-eabi-gdb blinky_c.elf

and below is the command history I input in the gdb terminal

(gdb) show commands
1  target remote :1235
2  load
3  info register
4  set $sp = 0x20020000
5  info register
6  b main
7  c

Then you can use the gdb to count instruction as in Counting machine instructions using gdb.

One big problem with the method is the speed is really slow, as gdb will use stepi to go through all the code to be counted before get a result. It cost me around 3 hours in my ubuntu VMware machine to get 5.5M instruction executed.

score 0 · Answer 3 · answered Aug 31 '22 at 20:30

One thing that you can do is use a simulation setup like the one used in this sample: https://github.com/swedishembedded/sdk/blob/main/samples/lib/control/dcmotor/src/main.c

This may look like an ordinary embedded application, but the data structure vdev actually resides in a different application running on the computer (in this case a dc motor simulator) and all reads and writes to it are automatically done over network by the simulator that runs this. The platform definition is here: https://github.com/swedishembedded/sdk/blob/main/samples/lib/control/dcmotor/boards/custom_board.repl This is how the structure is mapped.

From here it is not hard to implement advanced memory profiling by directly capturing reads and writes from the simulated application (which in this case is compiled for STM32 ARM).

How to use qemu to do profiling on a algorithm

3 Answers3

Linked