2

Simple but yet complicated question:

What counter to use to get perf tools to measure wall clock time?

As a base line the first thing when profiling code I think I need to measure is just wall clock time to get an first idea where the code takes most of the time. I don’t care if it’s IO or bandwidth limited or something else I just want to know where it is slow.

Sounds simple requirement, but with all the many tricks modern CPUs do to work efficient (like frequency scaling etc.) and the hell lot of different (not so well documented) performance counters available in perf, it’s not easy to be sure measuring the right thing.

Currently I do:

perf record -g -e ref-cycles -F 999 -- <cmd>

I think this is unscaled CPU frequency and thus proportional to the amount of wall clock time that part of the code is running. But who the hell knows?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Peter
  • 785
  • 2
  • 7
  • 18
  • Yes, ref-cycles on a modern CPU ticks at a constant rate *always*, even when the core clock is halted. (The CPU feature is `constant_tsc` (and `nonstop_tsc` which is really the same feature bit: [How to get the CPU cycle count in x86\_64 from C++?](//stackoverflow.com/a/51907627)).) Of course there's also the software event `task-clock` based on kernel-measured CPU time. IDK if that would work well or not. – Peter Cordes Feb 12 '20 at 14:27
  • Oh, but **the `ref-cycles` *perf event* does stop when the core clock stops**. It's separate from the actual TSC. (The real HW event on modern Intel is `cpu_clk_unhalted.ref_tsc` or `cpu_clk_unhalted.ref_xclk_any`). Even clock halts to change CPU frequency affect it: [Lost Cycles on Intel? An inconsistency between rdtsc and CPU\_CLK\_UNHALTED.REF\_TSC](//stackoverflow.com/q/45472147). And that's for a workload that doesn't sleep. So `ref-cycles` is fine for finding CPU hotspots, but not for overall profiles where I/O waits matter. – Peter Cordes Feb 12 '20 at 14:56
  • Do you have any recommendation for measuring the general WCT? Is there any event available that just reads the TSC? Or is that approach the wrong idea in general? – Peter Feb 12 '20 at 20:02
  • Ok. I think I misunderstood your comment. Did you say *cpu_clk_unhalted.ref_tsc* is what I’m looking for or did you say it’s affected by halts? – Peter Feb 12 '20 at 20:09
  • My first comment was part brain-fart, 2nd comment is a correction. I guess I should have deleted / reposted a corrected version. – Peter Cordes Feb 12 '20 at 23:00

1 Answers1

3

You can use task-clock.

This is explicitly wall clock time while the process is running and as a bonus is portable because it doesn't rely on any PMU event.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • Do you know some authoritative source for this claim, because quick search on the internet results in a lot of speculative and partly contradictory statements about what it is. – Peter Feb 12 '20 at 19:50
  • 1
    @Peter - I don't think there's any doubt that task-clock is wall clock time (per thread). It's also present in the default event list and used to calculate derived metrics like MHz, so you can be pretty sure it works sanely (unlike say cpu-clock). If you want exhaustive proof, you'll probably have to look at the source. – BeeOnRope Feb 12 '20 at 20:44
  • 1
    Are you sure this is correct? Because `perf record -g -e task-clock -F 100 -- sleep 3` collects no samples for me, suggesting it does not count wall time. Only if I add `-a` to collect all cores, it returns `1200` samples for my 4-core machine. – Quimby Aug 01 '22 at 11:40
  • 1
    I am not saying it is incorrect, just pointing out this does not do off-cpu-time sampling. I've found this question while searching for a solution which can do that. Later, I think [these answers](https://stackoverflow.com/questions/23098153/how-to-use-linux-perf-tool-to-generate-off-cpu-profile) might be the it. – Quimby Aug 01 '22 at 11:50
  • 2
    @Quimby - that is correct, `task-clock` only samples the period of time where the process is "on CPU". Re-reading the OP's question, it does seem possible or even probable that they were asking about off CPU sampling and thus this answer is not appropriate. "Off CPU" profiliing is considerably more complicated in the general case since you have the problem of distinguishing threads that are "blocked but have useful work they could do if not blocked" versus those which are simply "blocked with no useful work to do", and this is application specific. – BeeOnRope Aug 02 '22 at 00:58
  • 1
    @BeeOnRope I agree, thank you, I just commented because I googled this in hopes to find "off cpu" way. Anyway, your solution seem to be appropriate for compute-bound tasks so I think the answer is just fine. – Quimby Aug 02 '22 at 06:03