How to measure the elapsead time below nanosecond for x86?

Question

I have searched and used many approaches for measuring the elapsed time. there are many questions for this purpose. For example, this question is very good but when you need an accurate time recorder I couldn't find a good method. For this, I want to share my method here to be used and be corrected if something is wrong.

UPDATE&NOTE: this question is for Benchmarking, less than one nanosecond. It's completely different from using clock_gettime(CLOCK_MONOTONIC,&start); it records time more than one nanosecond.

UPDATE : A common method to measure the speedup is repeating a section of the program which should be benchmarked. But, as mentioned in comment it might show different optimization when the researcher rely on autovectorizing.

NOTE It's not accurate enough to measure the elapsed time in one repeatinng. In some cases my results show that the section must be repeated more than 1K or 1M to get the smallest time.

SUGGESTION : I'm not familiar with shell programming (just know some basic commands...) But, it might be possible to measure the smallest time with out repeating inside the program.

MY CURRENT SOLUTION In order to prevent the branches I repeat the ode section using a macro #define REP_CODE(X) X X X... X X which X is the code section I want to benchmark as follows:

//numbers
#define FMAX1 MAX1*MAX1
#define COEFF 8 
int __attribute__(( aligned(32))) input[FMAX1+COEFF];           //= {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17};
int __attribute__(( aligned(32))) output[FMAX1];
int __attribute__(( aligned(32))) coeff[COEFF] = {1,2,3,4,5,6,7,8};//= {1,1,1,1,1,1,1,1};//;            //= {1,2,1,2,1,2,1,2,2,1};

int main()
{
    REP_CODE(
        t1_rdtsc=_rdtsc();
        //Code
        for(i = 0; i < FMAX1; i++){
            for(j = 0; j < COEFF; j++){//IACA_START
                output[i] += coeff[j] * input[i+j]; 

            }//IACA_END
        }
        t2_rdtsc=_rdtsc();
        ttotal_rdtsc[ii++]=t2_rdtsc-t1_rdtsc;
        )
    // The smallest element in `ttotal_rdtsc` is the answer
}

This does not impact the optimization but also is restricted by code size and compiling time is too much in some cases.

Any suggestion and correction?

Thanks in advance.

This is a needed question in StackOverflow. Why people didn't like it! — Amiri, Apr 27 '17 at 01:43
I can't help but wonder how little such microbenchmarks would mean... Real programs are a lot more complex than one small bit of instructions you call just once. Timing that seems unhelpful at best. — rubenvb, Jul 15 '17 at 09:33
It's not for an instruction. BTW, for example similarity measurement in Multimedia application uses an algorithm which is done below a Nono second. For this I benchmarked gcc, clang, and icc and dont forget small things make the largest one. In this case me and some other people working on small things help to make smaller part faster then the whole program is fast enough. As an example VLC is a real application just think about it's components and how much it's important to proccess every thing as fast as possible.. — Amiri, Jul 15 '17 at 09:39
well, a real multimedia application executes the small things many times, in combination with a lot of other things, including memory access and lots of other instructions. It would seem to me that benchmarking a full pass of some full operation would be a lot more meaningful to see if your micro optimisations actually optimised something in the whole. — rubenvb, Jul 15 '17 at 09:43
"this question is for Benchmarking, less than one nanosecond.", why, < 1 nanosecond is not enough for you ? Use of rdtsc is discouraged. By the way, choose one language tag please C is not C++ is not C. — Stargateur, Jul 15 '17 at 13:47
@rubenvb, you are right. But, both are needed. If you are working to optimize a real application it's fine. — Amiri, Jul 15 '17 at 16:31
@Stargateur, because most of my programs are done less than a nano sec. and I have to measure them in 3 different compilers. `rdtsc` is an accurate tools and I didn't find any alternative method for x86 — Amiri, Jul 15 '17 at 16:35
You simply cannot accurately measure time at such a small interval without dedicating timing hardware. `rdtsc` is not reliable for this purpose. If you want to benchmark fast operations, do like the other 10,000s of benchmarks do: Run the operation many thousands or millions of times, then divide by the number of iterations. This isn't rocket science. — Jonathon Reinhart, Jul 15 '17 at 16:44
@FackedDeveloper, simple rdtsc is not anywhere accurate to measure 3-4 instructions (they are 1 nanoseconds total on 3-4 GHz CPUs), because there are: out-of-order execution of commands (and rdtsc is not serialized, and serialization is > 10 ticks) and long CPU pipeline (>12 stages, or > 3-4 ns). It is just impossible to have just "simple" time to execute some command, and you should not compare compilers but the machine code generated by compilers with good knowledge of microarchitecture, using simulators (intel IACA) and perf counters (pmu-tools - ocperf.py). — osgx, Jul 15 '17 at 16:46
@JonathonReinhart, It's the only reason that sometimes, I can not trust `angerfog` documents that they measured like this. In a small test I got numbers between 4k and 400 cycles... do you thing the best solution is mean or average? — Amiri, Jul 15 '17 at 16:48
@osgx, IACA is abandoned since `Haswell` however I still use it for my `SKL` because there is not much differences between what IACA counts. BTW, I use perf too. Valgrind so so! BTW, I think all these tools should be used but measuring in runtime is a must. If you are interested let me upload a piece of code and test to see the differences. — Amiri, Jul 15 '17 at 16:54
Yes, the average of running 1000s of iterations (against realistic workload) is best. — Jonathon Reinhart, Jul 15 '17 at 17:03
perf is not suitable for small program when you don't want to repeat a section... — Amiri, Jul 15 '17 at 17:28

score 4 · Accepted Answer · answered Jul 16 '17 at 03:07

If you have problem with autovectorizer and want to limit it just add a asm("#somthing"); after your begin_rdtsc it will separate the do-while loop. I just checked and it vectorized your posted code which auto vectorizer was unable to vectorize it. I changed your macro you can use it....

long long t1_rdtsc, t2_rdtsc, ttotal_rdtsc[do_while], ttbest_rdtsc = 99999999999999999, elapsed,  elapsed_rdtsc=do_while, overal_time = OVERAL_TIME, ttime=0;
int ii=0;
    #define begin_rdtsc\
                    do{\
                        asm("#mmmmmmmmmmm");\
                        t1_rdtsc=_rdtsc();

    #define end_rdtsc\
                        t2_rdtsc=_rdtsc();\
                        asm("#mmmmmmmmmmm");\
                        ttotal_rdtsc[ii]=t2_rdtsc-t1_rdtsc;\
                    }while (ii++<do_while);\    
                    for(ii=0; ii<do_while; ii++){\
                        if (ttotal_rdtsc[ii]<ttbest_rdtsc){\
                            ttbest_rdtsc = ttotal_rdtsc[ii];}}\             
                    printf("\nthe best is %lld in %lld iteration\n", ttbest_rdtsc, elapsed_rdtsc);

Amiri · Answer 2 · 2017-07-16T02:05:16.103

-3

I have developed my first answer and got this solution. But, I still want a solution. Because it is very important to measure the time accurately and with the least impacts. I put this part in a header file and include it in main program files.

//Header file header.h
#define count 1000 // number of repetition 
long long t1_rdtsc, t2_rdtsc, ttotal_rdtsc[count], ttbest_rdtsc = 99999999999999999, elapsed,  elapsed_rdtsc=count, overal_time = OVERAL_TIME, ttime=0;
int ii=0;
#define begin_rdtsc\
                    do{\
                        t1_rdtsc=_rdtsc();

#define end_rdtsc\
                        t2_rdtsc=_rdtsc();\
                        ttotal_rdtsc[ii]=t2_rdtsc-t1_rdtsc;\
                    }while (ii++<count);\   
                    for(ii=0; ii<do_while; ii++){\
                        if (ttotal_rdtsc[ii]<ttbest_rdtsc){\
                            ttbest_rdtsc = ttotal_rdtsc[ii];}}\             
                    printf("\nthe best is %lld in %lldth iteration \n", ttbest_rdtsc, elapsed_rdtsc);

//Main program
#include "header.h"
.
.
.
int main()
{
    //before the section
    begin_rdtsc
       //put your code here to measure the clocks.
    end_rdtsc
    return 0
}

edited Jul 16 '17 at 02:05

answered Jul 15 '17 at 08:18

Amiri

2,417
1
15
42

This solution is not a good one when you turn the autovector on. it restrict auto vector. even a simple loop make the auto vector confused – Amiri Jul 15 '17 at 08:26
`do_while` is one of the worst variable names I've ever seen. Call it `reps` or `count` or something. And your sample `main` doesn't define any of the variables that your macros depend on. Putting that stuff in a macro just doesn't seem to gain you anything compared to having the repeat-loop right there in `main()` – Peter Cordes Jul 16 '17 at 01:03
Hey @PeterCordes. You and Paul taught me many points. They blocked me to ask question in SO and I've created this account. You might understand me and help me again. Please if you know any way to measure time tell me. I think the smallest time is much accurate than mean or average. For this I should repeat the section and the problem is when I put my code inside a loop it limits the autovectorizer – Amiri Jul 16 '17 at 02:08
I just changed the loop inside the main question and you can see how a simple loop around the program limits the autovectorizer in icc and clang. gcc works fine. – Amiri Jul 16 '17 at 02:18
That hasn't usually been my experience with making microbenchmarks, but I do usually use gcc, and usually I'm manually vectorizing with intrinsics. Try putting your code under test in its own function. You can make it `__attribute__((noinline))` to disable having it hoist its setup out of the repeat-loop, if that's a problem. – Peter Cordes Jul 16 '17 at 02:21
The main problem with your premise is that something that only takes a few clock cycles interacts with the surrounding code because of out-of-order execution. Measuring A, B, and C separately doesn't necessarily tell you how long it will take to run A+B+C, because OOO execution can overlap them. – Peter Cordes Jul 16 '17 at 02:23
But, If I disable to put function in my code as an inline function I have function overhead. Am i wrong? – Amiri Jul 16 '17 at 02:33
Right, of course. Profile a big enough function that it doesn't matter much. Like one that already loops over an array. Or don't use `noinline` in that case; putting it in a separate function may still help. – Peter Cordes Jul 16 '17 at 02:36
and I don't have any other part in my program I only benchmark some multimedia kernels and analyzed them with perf and IACA but I couldn't find any thing better than speedup to evaluate the performance of GCC, ICC, and Clang. – Amiri Jul 16 '17 at 02:36
I separated and I have many other from intrinsics(SSE4 and AVX2) and this scalar one... and working to boost auto vectorizer. So, I cant ignore this problem. – Amiri Jul 16 '17 at 02:38
1

I just checked and putting in a separate function does not help, but `__attribute__((noinline))` helps and doesn't limit the autovectorizer to vectorize it. – Amiri Jul 16 '17 at 02:40
The compiler-generated code will be slightly different when it inlines into the real use-case so you should try to actually measure that. Those differences can be what's actually important, because compilers don't really model the CPU pipeline; the difference could actually be important. You can maybe tweak it slightly to reread the same memory to avoid cache misses. – Peter Cordes Jul 16 '17 at 02:41
Thanks @PeterCordes. I think I have no choice. Except that `REP_CODE(X) XXX...XXX` works fine. – Amiri Jul 16 '17 at 02:51
Isn't there any `#pragma` commands to tell the compiler don't think about the entire loop and vectorize the inner loops? – Amiri Jul 16 '17 at 02:53
There's `#pragma omp SIMD` I think, but it's a slightly different auto-vectorizer than the regular one. – Peter Cordes Jul 16 '17 at 14:10
@PeterCordes, Thank you. I'm familiar with this command. I meant Is there any command to tell the autovectorizer "don't try to vectorize this loop?" or "do not participate this loop in auto-vectorizer algorithm?" I want to tell that do while is some thing else and don't participate it just vectorize the inner loops. – Amiri Jul 16 '17 at 22:11

Amiri · Answer 3 · 2017-04-25T07:51:26.753

-5

I recommend using this method for x86 micro-architecture.

NOTE:

NUM_LOOP should be a number which helps to increase the accuracy with repeating your code to record the best time
ttbest_rdtsc must be bigger than the worst time I recommend to maximize it.
I used (you might not want it) OVERAL_TIME as another checking rule because I used this for many kernels and in some cases NUM_LOOP was very big and I didn't want to change it. I planned OVERAL_TIME to limit the iterations and stop after specific time.

UPDATE: The whole program is this:

#include <stdio.h>
#include <x86intrin.h>

#define NUM_LOOP 100 //executes your code NUM_LOOP times to get the smalest time to avoid overheads such as cache misses, etc.

int main()
{
    long long t1_rdtsc, t2_rdtsc, ttotal_rdtsc, ttbest_rdtsc = 99999999999999999;
    int do_while = 0;
    do{

        t1_rdtsc = _rdtsc();
            //put your code here
        t2_rdtsc = _rdtsc();

        ttotal_rdtsc = t2_rdtsc - t1_rdtsc;

        //store the smalest time:
        if (ttotal_rdtsc<ttbest_rdtsc)
            ttbest_rdtsc = ttotal_rdtsc;

    }while (do_while++ < NUM_LOOP); 

    printf("\nthe best is %lld in %d repetitions\n", ttbest_rdtsc, NUM_LOOP );

    return 0;
}

that I have changed to this and added to a header for my self then I can use it simply in my program.

#include <x86intrin.h>
#define do_while NUM_LOOP
#define OVERAL_TIME 999999999
long long t1_rdtsc, t2_rdtsc, ttotal_rdtsc, ttbest_rdtsc = 99999999999999999, elapsed, elapsed_rdtsc=do_while, overal_time = OVERAL_TIME, ttime=0;
#define begin_rdtsc\
                do{\
                    t1_rdtsc=_rdtsc();

#define end_rdtsc\
                    t2_rdtsc=_rdtsc();\
                    ttotal_rdtsc=t2_rdtsc-t1_rdtsc;\
                    if (ttotal_rdtsc<ttbest_rdtsc){\
                        ttbest_rdtsc = ttotal_rdtsc;\
                        elapsed=(do_while-elapsed_rdtsc);}\
                    ttime+=ttotal_rdtsc;\
                }while (elapsed_rdtsc-- && (ttime<overal_time));\
                printf("\nthe best is %lld in %lldth iteration and %lld repetitions\n", ttbest_rdtsc, elapsed, (do_while-elapsed_rdtsc));

How to use this method? Well, it is very simple!

int main()
{
    //before the section
    begin_rdtsc
       //put your code here to measure the clocks.
    end_rdtsc
    return 0
}

Be creative, You can change it to measure the speedup in your program, etc. An example of the output is:

the best is 9600 in 384751th iteration and 569179 repetitions

my tested code got 9600 clock that the best was recorded in 384751enditeration and my code was tested 569179 times

I have tested them on GCC and Clang.

edited Apr 25 '17 at 07:51

answered Apr 24 '17 at 03:58

Amiri

2,417
1
15
42

2

But your benchmarking code contains branches... Also, is there any reason why you aren't using inline functions? At the very least use \ to format the messy macro... – Lundin Apr 24 '17 at 08:54
what do you mean `your benchmarking code contains branches` ? it's ok with branches, I think. Becuase it's just a simplified version. It measures the clocks correctly I checked and it works for functions, loops, etc. about using \ I tried to use but I couldn't rely on it. because this implementation exactly putt the statements in the code. I think macro might not be suitable. BTW, if you think You can develop this solution please provide another answer according to my solution. I will accept it and appreciate it – Amiri Apr 24 '17 at 10:46
4

No, it is not ok with branches since they create a tight coupling between the benchmarking code and the code to be clocked. Depending on the rest of the code, your benchmarking code might get optimized differently, or affect the optimization of the code to be clocked. So you'll have a non-linear, non-deterministic execution time overhead caused by the benchmarking code itself. Regarding \, there is no reason why you can't use it. It just splits a line in two. – Lundin Apr 24 '17 at 11:25
2

The code is completely unreadable because it is 10 lines long, crammed onto a single line. But even without reading it, it is clear that you have made this *way* too complicated. Make one call to `RDTSC` at the beginning, before the code you want to time, and save that value. Then, at the end of the code sequence you're timing, make a second call to `RDTSC`. Subtract the two values, and you have the result. No loops or branches required, which means your timing isn't disrupted. And the code is substantially more readable. Note that `RDTSC` requires a Pentium or later, so won't work on an 8088. – Cody Gray - on strike Apr 24 '17 at 11:28
@Lundin, thanks. I should be aware of optimization. Regarding \, its lack of my knowledge about \ I am studying about it. – Amiri Apr 25 '17 at 04:37
BTW, to all people, I am sure the first recorded time is not trusty. you should record more time and rely on the best time. However, when you call `RDTSC` two times the differences in my Skylake is 16 clock. – Amiri Apr 25 '17 at 04:39
@Lundin, How should I correct this for branches? How should I measure the smallest time when I have some branches in my program and I should repeat to record the smallest time? – Amiri Apr 25 '17 at 05:21
@CodyGray, it is complicated but if it work correctly I can use it for all my kernels, For example, I want to run between 10 to 1000000 times to record the best time. some kernels take too much time so I want to run them less, I can set the `OVERAL_TIME` to a number and test it. It will help me to change the iteration number automatically. It help me to not wait longer than a specific number and dont rely only on `NUM_LOOP` – Amiri Apr 25 '17 at 05:54
`RDTSC` is a slow instruction. The 16 clock cycles between the two calls to `RDTSC` is the overhead of calling `RDTSC`. In a good benchmark, you'd time that overhead and subtract it from the total. I have no idea what you mean by "kernels". Kernels are certainly a thing (the core components of an operating system), but don't have any relevance here. – Cody Gray - on strike Apr 25 '17 at 09:02
@FackedDeveloper Like Cody Gray says, simply read the timer twice, before and after the measurement. If several measurements need to be done to increase reliability, then handle that separately. Essentially what you are doing, just skip the logic part during the measurement and simply store all results in an array. Something like this pseudo: `loop { start = timer(); /* code to benchmark */ end=timer(); result[i] = end - start; }`. Then _afterwards_ you can go look for best/worse measurement, mean or median etc etc. – Lundin Apr 25 '17 at 09:35
Benchmarking on hosted desktop computers is always gonna be sketchy though. These are no RTOS and there may be a context switch at any time. Multi-core and multi-threading also complicates it further. – Lundin Apr 25 '17 at 09:38
@CodyGray, "kernels": For example, in multimedia application, matrix multiplication could be a kernel, etc. – Amiri Apr 25 '17 at 09:55
@Lundin, I use perf to know about overheads such as context switch, etc. I want to repeat. so what do you thing about using a macro which repeats my benchmark section before compiling? a macro like `#define REP_CODE(X) X X X X X X X X X X` which X is the timing and benchmarking? BTW, thanks I think store to an array might be better than my solution. – Amiri Apr 25 '17 at 10:00
For timing: I used to work with `clock_gettime` but it can only measure time more than 1 nanosecond (as I mentioned in the question title). Because some time I want to benchmark and both scalar and SIMDized are finished less than one nanosecond I use `RDTSC`. I couldn't find any function working better than this intrinsic. Moreover, `clock_gettime` make the upper AVX registers dirty and I have to use a `zeroupper` after the first call before benchmarking call. Like this: `clock_gettime(CLOCK_MONOTONIC,&start);` `asm__ __volatile__ ( "vzeroupper" : : : );` – Amiri Apr 25 '17 at 10:09

How to measure the elapsead time below nanosecond for x86?

3 Answers3

Linked