-2

I need to calculate the running time of a portion of (C++) code and want to do this by finding the number of clock ticks elapsed during the execution of the code.

I want to find the processor time at the beginning of the code and the processor time at the end and then subtract them to find the number of elapsed ticks.

This can be done with the clock function. However, the time I'm measuring needs to be very precise and using a function call proved to be very intrusive since the caller-saved register allocator spilled many variables on each call.

Therefore, I cannot use any function calls and need to retrieve the processor time myself. Assembly code is fine.

I am using Debian and an i7 Intel processor. I can't use a profiler because it's too intrusive.

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
Farhad
  • 516
  • 3
  • 14
  • 1
    Any particular reason you're not using a profiler ? – Captain Obvlious Oct 29 '17 at 18:11
  • 1
    Do you care about context switches affecting the timing? That is, if your program reads the "clock time", runs for a bit, then is left "ready to run" for awhile while a higher priority process runs, then later continues running, which time do you want? The "wall clock time" or the actual cpu used time of your process? – wallyk Oct 29 '17 at 18:11
  • 2
    If the time taken for pushing a few vars is so significant, I suspect that you cannot do what you want on a general-purpose OS. – Martin James Oct 29 '17 at 18:12
  • 1
    @wallyk The actual CPU used time for my process. – Farhad Oct 29 '17 at 18:14
  • @CaptainObvlious I can't use a profiler because I need to time various portions of my program, not the runtime of the program as a whole. – Farhad Oct 29 '17 at 18:15
  • @Ron Isn't that a function call? – Farhad Oct 29 '17 at 18:17
  • @FarhadYusufali _"because I need to time various portions of my program"_ Isn't that exactly what a profiling tool does? – user0042 Oct 29 '17 at 18:20
  • @user0042 Would a profiling tool allow me to measure the run time of a specific loop for instance? Also, I think a profiler would use many function calls which would again create a lot of noise in my data. – Farhad Oct 29 '17 at 18:22
  • @FarhadYusufali Usually any modern profiling tools allow you to break down measurements to line numbers in your code. So yes, you could determine the average or total runtime for a specific loop. – user0042 Oct 29 '17 at 18:25
  • If there is so much code noise around the `for` loop that it cannot be isolated then chances are the function it resides in is too large to begin with and should be refactored into more manageable chunks. – Captain Obvlious Oct 29 '17 at 18:31
  • Please **edit your question** to explain why you absolutely need to avoid function calls. What that means exactly? That you don't want to depend on any `libc` ? Why? – Basile Starynkevitch Oct 29 '17 at 18:33
  • Your best bet is to run bare metal (no OS) or at least in kernel mode with interrupts disabled, and access your clock hardware directly in assembly language. Failing that, figure out how much exactly a call to `clock_gettime` adds to your run time including register spillage and all, and subtract that. Alternatively, measure total time of a loop that runs many iterations of the same thing so that relative effect of `clock_gettime` on each individual iteration is negligible. – n. m. could be an AI Oct 29 '17 at 18:36
  • @BasileStarynkevitch What do you mean? My question already explains why I need to avoid function calls. – Farhad Oct 29 '17 at 18:38
  • 2
    Unfortunately, you're not going to avoid the overhead -- measuring the time will require some registers, so some other data will need to be spilled out of registers. The overhead of that and a function call is less than the general uncertainty of measuring an OOO pipeline, so it generally doesn't matter, but it does mean that any way you measure time on the CPU, it will have an error of a least a few nanoseconds. Doing better will require hooking up a high-speed scope to the hardware somehow. – Chris Dodd Oct 29 '17 at 19:03
  • No, your explanation is not good enough. – Basile Starynkevitch Oct 29 '17 at 19:04
  • 2
    How old is this research paper you are working from? Modern multi-core CPUs with prediction and pipe-lining and all the other goodies that make them fast can frustrate the smurf out of timing. Trying to get hyper precise timing while using a desktop OS will make it even harder. – user4581301 Oct 29 '17 at 19:07
  • 1
    You should probably add a link to the paper you are citing in the comments. – jww Oct 29 '17 at 19:29
  • What kind of code are you writing? Consider showing some [MCVE] explaining your concerns and the kind of routine you want to optimize or benchmark. – Basile Starynkevitch Oct 29 '17 at 19:45
  • 1
    You might also take a look at some of the search results from [microbenchmark site:stackoverflow.com](https://www.google.com/search?q=microbenchmark+site:stackoverflow.com). I have never seen the need for that type of granularity, though. For completeness, I added ARMv8-a and Power8 AES implementations to Crypto++, and I did not need them to verify the implementation was performing. One thing you should do is change the [governor settings to performance](https://github.com/weidai11/cryptopp/blob/master/TestScripts/governor.sh) (instead of on-demand or powersave).. – jww Oct 29 '17 at 20:22

4 Answers4

6

You should read time(7). Be aware that even written in assembler, your program will be rescheduled at arbitrary moments (perhaps a context switch every millisecond; look also into /proc/interrupts and see proc(5)). Then any hardware timer is meaningless. Even using the RDTSC x86-64 machine instruction to read the hardware timestamp counter is useless (since after any context switch it would be wrong, and the Linux kernel is doing preemptive scheduling, which does happen at any time).

You should consider clock_gettime(2). It is really fast (about 3.5 or 4 nanoseconds on my i5-4690S, when measuring thousands of calls to it) because of vdso(7). BTW it is a system call, so you might code directly the assembler instructions doing them. I don't think it is worth the trouble (and could be slower than the vdso call).

BTW, any kind of profiling or benchmarking is somehow intrusive.

At last, if your benchmarked function runs very quickly (much less than a microsecond), cache misses become significant and even dominant (remember that an L3 cache miss requiring effective access to DRAM modules lasts several hundred nanoseconds, enough to run hundreds of machine instructions in L1 I-cache). You might (and probably should) try to benchmark several (hundreds of) consecutive calls. But you won't be able to measure precisely and accurately.

Hence I believe that you cannot do much better than using clock_gettime and I don't understand why it is not good enough for your case... BTW, clock(3) is calling clock_gettime with CLOCK_PROCESS_CPUTIME_ID so IMHO it should be enough, and simpler.

In other words, I believe that avoiding any function calls is a misconception from your part. Remember that function call overhead is a lot cheaper than cache misses!

See this answer to a related question (as unclear as yours); consider also using perf(1), gprof(1), oprofile(1), time(1). See this.

At last, you should consider asking more optimizations from your compiler. Have you considered compiling and linking with g++ -O3 -flto -march=native (with link-time optimizations).

If your code is of numerical and vectorial nature (so obviously and massively parallelisable), you could even consider spending months of your development time to port its core code (the numerical compute kernels) on your GPGPU in OpenCL or CUDA. But are you sure it is worth such an effort? You'll need to tune and redevelop your code when changing hardware!

You could also redesign your application to use multi-threading, JIT compilation and partial evaluation and metaprogramming techniques, multiprocessing or cloud-computing (with inter-process communication, such as socket(7)-s, maybe using 0mq or other messaging libraries). This could take years of development. There is No Silver Bullet.

(Don't forget to take development costs into account; prefer algorithmic improvements when possible.)

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • Aren't these function calls? – Farhad Oct 29 '17 at 18:27
  • So what? Why do you need to avoid function calls? – Basile Starynkevitch Oct 29 '17 at 18:32
  • What do you mean? The entire point of the question is to avoid function calls. I even bolded it in my description... – Farhad Oct 29 '17 at 18:37
  • 4
    @FarhadYusufali: That is because avoiding function calls is impractical and probably unnecessary. Most of us have tried to do something similiar but found it is "good enough" to use a crude timer (like a stopwatch) while altering the program to repeat the code under scrutiny many times (like a million). Almost all modern CPUs have many kinds of timing variability: memory caching, register lookahead, branch prediction, etc. – wallyk Oct 29 '17 at 18:40
  • @wallyk Unfortunately, it's not good enough in this case. I need very precise timing, and the research paper explicitly emphasizes the importance of avoiding function calls. – Farhad Oct 29 '17 at 18:41
  • 4
    @FarhadYusufali You're trying to use a micrometer to measure something cut with an axe. The results will not have the precision you think they will. – Andrew Henle Oct 29 '17 at 19:33
  • You can solve the CPU-migration problem with `taskset -c 3 ./a.out`. Or `taskset -c 3 perf stat ./a.out`. – Peter Cordes Oct 30 '17 at 04:42
  • That does not stop the scheduler. – Basile Starynkevitch Oct 30 '17 at 06:30
  • @AndrewHenle - that's a common myth, but modern x86 CPUs have very fine-grained measurement capabilities. As long as what you are trying to measure is at least a few 10s of cycles you can get a fairly accurate benchmark with user-space `rdpmc`. If you can isolate the code you are interested in into a benchmark, you can probably get cycle-accurate measurements. – BeeOnRope Oct 30 '17 at 07:56
  • @BeeOnRope Great. So you can measure your instruction execution down to the nanosecond. Nice micrometer measurements. Those are almost useless numbers when the scheduler is lopping execution times for your process with an axe. "This tree I just chopped down is 14.1248157131132 meters long!" you exclaimed. "That's great. Do it again." That's not a useful precision for the tools being used here. – Andrew Henle Oct 30 '17 at 10:24
  • @AndrewHenle I don't understand your analogy, it makes no sense here. What do you mean by "the scheduler is lopping execution times for your process with an axe"? – BeeOnRope Oct 30 '17 at 10:25
  • 1
    @BeeOnRope When the sceduler is handing out timeslices in the order of perhaps up to tens of milliseconds and the overhead of unpredictable context switches is added in, there's absolutely no point in measuring with a precision of nanoseconds. If you expect that amount of precision to be meaningful, you can't run on the type of general-purpose OS under discussion here. Just because you *can* measure a chopped-down tree with a micrometer doesn't make such a precise measurement useful for improving how many trees you can harvest. – Andrew Henle Oct 30 '17 at 10:33
  • 1
    Of course there is a point. If your operation takes 100ns, the chance the scheduler interrupts it is minuscule. Furthermore, the interruptions are entirely detectable by examining the interrupt counter (via PMC) or just when you see a large outlier. If you are really obsessed with the scheduler, just enable tickless kernel and use RT priorities. In a large program with thousands of context switches and a huge number of interesting functions, measuring individual intervals accurately probably doesn't matter: I agree! That doesn't describe all programs, however. @AndrewHenle – BeeOnRope Oct 30 '17 at 10:39
2

Would a profiling tool allow me to measure the run time of a specific loop for instance?

Put your loop in an executable by itself and run perf stat ./a.out to time it and count cache-misses, branch-misses, and whatever else. Wrap a repeat loop around it if it's too short. Then you have a microbenchmark that you can profile with hardware performance counters to analyze how it decodes to uops, whether they schedule as efficiently as you thought, and so on.

Check the asm output to make sure it doesn't optimize anything away, or optimize across iterations of the outer repeat loop. In C you can use asm volatile("" ::: "memory"); as a compiler memory barrier, or you can use asm("" : "+g"(my_variable)) to force the compiler to have a C variable in a register at each step, with an empty asm statement as part of the dependency chain.

See my answer on Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example using pure asm to make a static binary. (Negligible startup overhead so perf counters for the whole executable have less noise. But if you run it for 1000 ms and startup only takes a couple microseconds, you have very high precision and repeatability in your measurements).

It's still a microbenchmark though, measuring your loop in isolation with caches and branch prediction primed, not like what happens when your loop runs occasionally in the context of your whole program. Don't let this fool you into unrolling it too much; that wins on microbenchmarks but can hurt in practice.

Also, I think a profiler would use many function calls which would again create a lot of noise in my data.

Profilers that use HW performance counters are very non-intrusive.

Using perf record ./main-program / perf report -Mintel will show you a statistically sampled map of hot asm instructions in your code. (But keep in mind that cycles usually get charged to the instruction waiting for an input, rather than the instruction that was slow to produce it.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • One problem with running your program for longer to amortize the starutp overhead is that you'll inevitably get context switches or at least timer interrupts. So there is a tension between longer measurements and shorter measurements. The most accurate measurements are those that fall under the tick time and yet still cancel out the startup overhead, either by only timing the relevant code or by running back-to-back runs with pure startup and startup + code under test and taking the difference. – BeeOnRope Oct 30 '17 at 08:01
2

On x86, you can do this using the rdpmc instruction, which allows you to read performance monitor counters from user space1. You don't need any function calls to use this: you can insert the rdpmc instruction inline with the code you want to measure.

In particular, one of the counters you can use is CPU_CLK_UNHALTED.CORE which will allow you to measure small segments of code down to cycle accuracy.

If you can isolate the part of the program you want to measure into a small benchmark, you can use a Linux-compatible benchmark program I am writing, uarch-bench, to time your code and it will take care of most of the complexity of setting up the counters, running the benchmark repeatedly and excluding outliers.

The benchmark described above uses libpfc to read the Intel PMU from userspace. It was created specifically to address scenarios like this. If you want to measure the cycles of some portion of code directly inside your existing process, you can use libpfc by itself, using the PFSTART and PFCEND macros around the code you want to measure. These are macros that inline the measurement code directly: it doesn't involve any function call.


1 At least as long as CR4.PCE is set, which is relatively straightforward on Linux.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
1

You have been given a lot of reasons that what you want to do isn't going to give you good results, but no one has mentioned how to do what you are asking so you can see for yourself.

static inline unsigned long long rdtsc()
{
    uint32_t a, d;
    asm volatile("rdtsc" : "=a"(a), "=d"(d));
    return (unsigned long long)d << 32 | a;
}

rdtsc is not serializing, so to prevent it from reordering in hardware (and executing before previous instructions are complete), you may want to use rdtscp or lfence; rdtsc .

lfence; rdtsc on your i7 also stops later instructions from starting before it runs. See Difference between rdtscp, rdtsc : memory and cpuid / rdtsc?.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
prl
  • 11,716
  • 2
  • 13
  • 31
  • 1
    You should use `asm volatile`, otherwise gcc is allowed to CSE and reuse the result of of an earlier call. [The manual uses x86 `rdtsc` and an example of when you need `volatile`](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Volatile) :P – Peter Cordes Oct 30 '17 at 04:34