Using C time functions to measure time: are they code-reordering resistant?

Question

When i add a piece of code to measure CPU execution time, like:

int main()
{
   clock_t init_time = clock(); // (1)
   your_fun(); // (2)
   printf("Seconds: %f", (double)(clock() - init_time) / CLOCKS_PER_SEC); // (3)
}

If I compile that piece of code with full optimization, can the code be reordered in such a way that the init_time initialization is executed after your_fun and so you measure nothing? In other words, does clock have any memory-barrier mechanism to protect a measuring block from reordering?

In case clock is not reordering-resistant: if I move (1) and (3) to new functions implemented in a different compilation unit so the compiler, when compiling main, cannot see what is inside them, can such reordering be prevented?

I have that question for a while because i usually see contradictory execution times between the time I was waiting for an execution (seconds) versus the printed execution time (ms or even 0).

What about the C++ clocks or local time functions? Do they suffer similar problems?

Micro-benchmarking is really hard to do correctly. Personally, I would use something that's already built to do this for you like Google's benchmark library. You can use a live version of that here: https://quick-bench.com/ — NathanOliver, May 03 '21 at 14:20
"If I move (1) and (3) to new functions implemented in a different compilation unit so the compiler, when compiling main, cannot see what is inside them, can such reordering be prevented?" This suggests, that It actually gets reordered with your code. Did you check that? — lulle2007200, May 03 '21 at 14:22
@lulle That was just an example. I want to understand the risk of using such approach and not fixing a particular use case. — ABu, May 03 '21 at 14:23
1) no; 2) no. C forbids reordering of statements (but not expressions). There is a **sequence point** at the end (at the `;`) of every statement. See, for example [C11 5.1.2.3](http://port70.net/~nsz/c/c11/n1570.html#5.1.2.3p3). *I have no idea about C++.* — pmg, May 03 '21 at 14:25
How about putting some explicit barriers in there and calling it a day? :-) — oakad, May 03 '21 at 14:27
Compile the function and look at the code it produces. The compiler may do whatever as long as the result matches the code you gave it. E.g. it may omit a function call that has zero effect on the program (void foo(void){return)). — lulle2007200, May 03 '21 at 14:30
Also, @Peregring-lk are you talking about what the language standard dictates here? If so, wouldn't hurt to throw a [language-lawyer] tag in there. — Marco Bonelli, May 03 '21 at 14:35
Did you know about [Compiler Explorer](https://godbolt.org/)? — gkhaos, May 03 '21 at 14:44
C and C++ are defined exactly the same way. In fact, the whole MT (threaded) semantics is called the C/C++ semantics. — curiousguy, May 03 '21 at 23:18
@oakad "_putting some explicit barriers in there_" How would one write those barriers? — curiousguy, May 03 '21 at 23:35
https://preshing.com/20120625/memory-ordering-at-compile-time/ (C++ now has convenient wrappers for those) — oakad, May 04 '21 at 04:32

Luis Colorado · Answer 1 · 2021-05-04T08:30:20.070

clock() call doesn't give you the cpu time used by your program. It gives you the number of ticks elapsed since boot (or last wraparound, if you have a large uptime) which was the only way to get subsecond timing in old legacy unix. But it is not process time, it is wall clock time.

As to illustrate this, I shall cite a paragraph from the clock(3) man page in the FreeBSD 13.0-STABLE distribution:

The clock() function conforms to ISO/IEC 9899:1990 (“ISO C90”). However, Version 2 of the Single UNIX Specification (“SUSv2”) requires CLOCKS_PER_SEC to be defined as one million. FreeBSD does not conform to this requirement; changing the value would introduce binary incompatibility and one million is still inadequate on modern processors.

Today, you can use the widespread gettimeofday() system call (also wall clock), which will give you the time of day from the unix epoch, with microsecond resolution (which is far more resolution than the tick of the clock call, and you don't need to know about the CLOCKS_PER_SEC constant) and this call is today the most portable way to do as almost all the unices available implement it, or better, the newer POSIX system calls clock_gettime(2) and friends, that allows you to use nanosecond resolution (if available) and allows you to select a process running clock (one that will give you your cpu time, and not the wall clock time)

This last system call is not available everywhere, but if your system claims to be posix, then you'll probably have a subset of the clocks specified.

       int clock_gettime(clockid_t clk_id, struct timespec *tp);

where

struct timespec

is filled with the clock time referred in clk_id

The fields of the struct timespec structure are:

tv_sec a time_t field representing the number of seconds since unix epoch (01/01/1970 00:00:00 UTC time)
tv_nsec a long value representing the number of nanoseconds since the last second tick. Its range goes from 0 to 999999999.

The values of the different clock ids have been taken from the linux online manual (Ubuntu release) The ones not marked as linux specific are POSIX, so their ids should be portable (although probably not as precise or quick as the ones implemented by the linux kernel)

CLOCK_REALTIME wall clock.
CLOCK_REALTIME_COARSE (linux specific) faster than the previous, but less precise.
CLOCK_MONOTONIC Wall clock but ensures that different calls will always give you ascending time (whatever this can mean). As the clock_gettime(2) manpage says, this will grow even if you make a system clock adjust backwards to the master clock.
CLOCK_MONOTONIC_COARSE (Linux specific) similar to the above, but monotonic.
CLOCK_MONOTONIC_RAW (linux specific)
CLOCK_BOOTTIME (linux specific) similar to the clock you used, but in ns instead of clock ticks.
CLOCK_PROCESS_CPUTIME_ID (linux specific) process time (not wall clock time) used by the whole process.
CLOCK_THREAD_CPUTIME_ID (linux specific) thread time (not wall clock time) this is the time your thread has been using the cpu, and this is the clock I think you must read.

So, finally your snippet could get:

   struct timespec t0, t1;

   int res = clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t0);
   // check errors from res.
   your_fun();
   int res = clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t1);
   // we'll subtract t0 from t1, so we get the delay in t1.
   if (t1.tv_nsec < t0.tv_nsec) { // carry to the seconds part.
      t1.tv_nsec += 1000000000L - t0.tv_nsec;
      t1.tv_sec--;
   } else {
      t1.tv_nsec -= t0.tv_nsec;
   }
   t1.tv_sec -= t0.tv_sec;
   // no need to convert to double or do floating point arithmetic.
   printf("Seconds: %d.%09d\n", t1.tv_sec, t1.tv_nsec);

You can wrap those calls into a callback function, that gives you as a result the execution time:

struct timespec *time_wrapper(
        void (*callback)(), // function to be called.
        struct timespec *work) // provide a working space so you don't need to allocate it.
{
    struct timespec t0;
    int res = clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t0);
    // check errors from res.
    if (res < 0) return NULL; // check errno.
    callback();
    int res = clock_gettime(CLOCK_THREAD_CPUTIME_ID, work);
    if (res < 0) return NULL; // check errno
    // we'll subtract t0 from *work, so we get the delay in *work.
    if (work->tv_nsec < t0.tv_nsec) { // carry to the seconds part.
        work->tv_nsec += 1000000000L - t0.tv_nsec;
        work->tv_sec--;
    } else {
        work->tv_nsec -= t0.tv_nsec;
    }
    work->tv_sec -= t0.tv_sec;
    return work;
}

and you can call it as:

#include <string.h>
#include <errno.h>
#include <stdlib.h>
#include <stdio.h>

int main()
{
   struct timespec delay;

   if (time_wrapper(your_fun, &delay) == NULL) { // some error
      fprintf(stderr, "Error: %s\n", strerror(errno));
      exit(EXIT_FAILURE);
   }

   printf("CPU Seconds: %lu.%09lu",
          (unsigned long) delay.tv_sec,
          (unsigned long) delay.tv_nsec);

   exit(EXIT_SUCCESS);
}

One final note: Compiler optimization cannot reorder the code in a way that makes the order you imposed in your statements to do a different thing than the one you expect. The only way your code can be affected by optimization is that you have incurred in some Undefined Behaviour in your program, and this means your code is incorrect.

If your code is correct, then optimization in the compiler cannot affect it.

score 1 · Answer 2 · answered May 03 '21 at 23:34

A call to a time reading function is a system call: the measurement will happen exactly at the right place, with respect to other system calls including I/O operations. That is not what you are asking for: you are timing a pure operation, a computation with no SE (side effect).

You want the computation to occur exactly at the right place, between two system calls; you need it to be ordered and not optimized out completely either.

As usual, to defeat optimizations, use volatile: all volatile operations are side effects, like a system call (like a call to read or write).

Note: You may count on truly separate compilation to achieve the same effect; that's if you're certain that global program optimization will not be applied.

By definition, side effects can never be reordered. Also, a volatile read has an indeterminate value, even of a constant:

const volatile int indeterminate_zero = 0;

Each instance of indeterminate_zero in any expression will yield a 0 that the compiler cannot assume is 0.

You need to bracket your computation with side effects, and make it dependent on unknown values, in order for the computation to occur precisely where you want.

Inserting volatile operations at the right place will do that for the trivial cost of a normal memory read or write: an operation on a volatile global integer type is exactly as costly as a regular (non volatile) similar operation with optimization disabled, on almost all compilers.

It's important that you see a volatile read as a special scanf and a write as a special printf where neither interacts with any STDIO stream and they both have minuscule cost, and are more portable than calling trivial system calls (like getpid, getppid, a write of 0 bytes...).

(I don't post an answer complete with compilable code because you have not presented a question complete with a runnable code that I could reformulate.)

If you don't want to bother will all that, just use separate compilation. All commonly used compilers support separate compilation (a compiler could legally mandate that you submit all your code at once, but none do - that I know of).

I don't think a volatile would suffice. A side-effect cannot be reordered respect to other side-effects because it would change the observable behaviour of the program. If I make `init_time` volatile, it won't be reordered respect to the `printf` statement reading such volatile, but I think the `your_fun` call can still be reordered above or below any of these two statements, so pretty much the same as without the use of the volatile. And now that I think about it, different compilation units won't be enough either; if I enable optimizations at linkage level, the problem would reappear. — ABu, May 04 '21 at 10:35

Alessandro Teruzzi · Answer 3 · 2021-05-05T08:45:30.797

In C++ you can leverage RAII.

Referring to this answer Is C++ compiler allowed to optimize out unreferenced local objects

This will guarantee to prevent the object to be optimized away. This will solve the OP issue:

I have that question for a while because i usually see contradictory execution times between the time I was waiting for an execution (seconds) versus the printed execution time (ms or even 0).

You can write a simple wrapper object that store the starting point in the constructor and the end point in the destructor (called when the object goes out of scope). You can simply allocate an object on the stack at the top of your function or using scope accordingly:

#include <chrono>
#include <iomanip>
#include <iostream>

class SimpleProfiler
{
    std::chrono::time_point<std::chrono::steady_clock> start;
public:
    SimpleProfiler() : start(std::chrono::steady_clock::now()) {}
    //copies make not much sense
    SimpleProfiler& operator=(const SimpleProfiler&) = delete;
    SimpleProfiler(const SimpleProfiler&) = delete;
    ~SimpleProfiler() 
    {
        auto end = std::chrono::steady_clock::now();
        std::chrono::duration<double> diff = end-start;
        std::cout << "Execution time" << std::setw(9) << ": " << diff.count() << " s\n";
    }
};


int main()
{
    SimpleProfiler sp; //goes out of scope at the end of the main
    your_func();
    { 
        SimpleProfiler inner; // this will go out of scope...
        another_one_here();
    } // ...here!
    return 0;
}

You can dress the class up a bit, for instance you can provide a label and better sink for the output than the standard output.

Two hints:

always remember to give a variable a name (otherwise the temp object will be destruct at the end of the statement)
prevent any exception to be thrown by the destructor

This does not seem to answer the question. I like the code snippet, sure... but it does not tell us if the compiler is allowed to reorder the calls to "clock". — André, May 04 '21 at 09:47
"can the code be reordered in such a way that the init_time initialization is executed after your_fun and so you measure nothing?" Maybe you are right, I thought the main issue was to have a guarantee method to avoid reordering. — Alessandro Teruzzi, May 04 '21 at 09:52

score -3 · Answer 4 · answered May 03 '21 at 14:42

-3

If I compile that piece of code with full optimization, can the code be reordered in such a way that the init_time initialization is executed after your_fun and so you measure nothing? In other words, does clock have any memory-barrier mechanism to protect a measuring block from reordering?

No... and no.

C forbids reordering of statements (though not expressions). There is a sequence point at the end (at the ;) of every statement.

See C11 6.8

[...] There is a sequence point between the evaluation of a full expression and the evaluation of the next full expression to be evaluated.

No barrier mechanism is needed.

answered May 03 '21 at 14:42

pmg

106,608
13
126
198

That alone is not sufficient. The compiler does not reorder the function calls because it is not able to prove that the `clock()` calls do not interfere with the `your_func()` call. – j6t May 03 '21 at 14:52
@j6t I was thinking about that, but my memory is fuzzy. I think sequence points are about what behaviour to expect from the point of view of the programmer, but later the compiler can reorder any sentence as far as the final output doesn't change the observable behaviour, which is defined in terms of IO operations, volatile variables and things like that. But the clock function is non-deterministic so I don't know how does it relate to "observable behaviour". I guess you are right about that. What surprise me is how many measurements in this world can be wrong for not knowing that. – ABu May 03 '21 at 15:06
@j6t so your last message implies that, if `your_fun` is inline, the compiler could totally apply an offending reordering to that piece of code if it sees an optimization opportunity right? – ABu May 03 '21 at 15:10
7

Sequencing rules, including sequence points, are specifications of behavior inside the abstract machine the C standard uses as a model. The C implementation is not required to implement operations in the order they are specified to be in in the abstract machine—it may implement any computation that produces the same *observable behavior*. The current time is not an observable behavior, and, if the compiler knows the observable behavior produced by `your_fun()` does not interact with the `clock()` calls, it may reorder them. – Eric Postpischil May 03 '21 at 15:24

Using C time functions to measure time: are they code-reordering resistant?

4 Answers4