How to get CPU instruction count for a thread?

Question

I know that getrusage() can provide per-thread CPU utilization, but only the time spent on the CPU. Is there any way to get the number of executed CPU instructions? Or the number of cycles spent on the cpu? Basically, I need to find a reproducible measure of how much the thread spends on the CPU. Any suggestions to do this in C?

UPDATE (to respond to comments):

Ideally I'd need this in a platform independent way, but Linux would be the most useful.
Reproducibility is the most important for me, even if that means the actual runtime may be slightly different.
I know vTune (and have used it), but I'd like to have this info programmatically while my code is running. So vTune is out, as well as the suggestions made in the post linked by Craig Estey.
I did look at the Intel Intrinsics Guide, but did not find anything useful...

Look up the documentation for your processor and/or compiler. The processor docs can tell you what information is available, and the compiler docs will tell you whether there's an intrinsic to expose that, and if not how to write inline assembly to fetch the information yourself. — Useless, Jan 29 '19 at 19:25
NB. Number of cycles is probably more useful, but not exactly reproducible wrt. pipeline stalls, cache misses etc. Number of instructions should reproducible, but doesn't tell you much about speed, since instruction latency isn't uniform (except perhaps on older RISC or embedded chips) — Useless, Jan 29 '19 at 19:29
Does the CPU even track this information? How do you define a "cycle"? What about failed branch predictions or cycles skipped during memory access or because of pipeline contention? Instruction count isn't really useful anyway since "instructions" come in a variety of forms, some slow, some fast. The only reasonable measure is how many nanoseconds your thread gets on any given core. From that you can work backwards to approximate how many cycles if you can, reliably, compute the CPU speed during those times. CPU speed changes frequently. — tadman, Jan 29 '19 at 19:32
No suggestion for doing this in C but if you are using an Intel processor take a look at https://en.wikipedia.org/wiki/VTune — Support Ukraine, Jan 29 '19 at 19:43
See: https://stackoverflow.com/questions/54355631/how-do-i-determine-the-number-of-x86-machine-instructions-executed-in-a-c-progra It is a virtual duplicate of your question and gives a number of different methods — Craig Estey, Jan 29 '19 at 20:07
You can use PAPI stuff like `perf_event_open()` from inside your program. Like the Linux `perf` command uses. — Peter Cordes, May 21 '20 at 06:49
@tadman: "cycle" is obviously a core clock cycle during which the CPU was running your thread. Yes, CPUs do track this, even with varying CPU frequency, in the fixed counter for the `cycles` hardware event (Intel PMU). With `perf` to virtualize that counter across context switches (to make it per task instead of per core), this is totally doable. — Peter Cordes, May 21 '20 at 06:52
@PeterCordes Although you can measure this, I question the utility of such a number. On legacy machines that ran one thread at a fixed clock it'd be a useful metric, but now CPUs change frequencies, run multiple threads per core, and other weirdness that means a cycles count is fairly detached from actual times. — tadman, May 21 '20 at 16:38
@tadman: I find that in CPU-intensive code that doesn't sleep or wait for I/O, looking at cycle counts instead of time is a useful way to factor out CPU frequency variation when tuning code to be more efficient on a clock-for-clock basis. That might not hold up if waiting for data from other cores is a factor, though. That depends on uncore clock, not this core's clock. (Although on desktop CPUs, all cores and the uncore are locked to the same frequency.) — Peter Cordes, May 21 '20 at 16:40
As a matter of fact, I agree with Peter Cordes. When I evaluate an algorithm I want to filter out variable information. If you run the same code on the same problem instance over and over again the variable stuff (like context switches, cpu going into turbo) tends to average out, so I do *not* want to my evaluation depend on those factors. Running the code hundreds of times is not feasible, so that's why I was asking if there is a repeatable measure I could use. @PeterCordes: can you recommend some sample code that I could look at? — LaszloLadanyi, May 22 '20 at 14:03
When I'm tuning a loop, I put it in a static executable by itself so I can just use `perf stat -e ...` on the whole binary. I haven't needed to use Linux `perf_event_open()` / PAPI for my own purposes. — Peter Cordes, May 22 '20 at 14:41

score 2 · Answer 1 · answered May 21 '20 at 06:33

Take a look at google's filament engine. They are doing exactly that. Look at their profiler. https://github.com/google/filament/blob/master/libs/utils/src/Profiler.cpp Also you can get more info from this link: https://www.youtube.com/watch?v=Lcq_fzet9Iw

How to get CPU instruction count for a thread?

1 Answers1