0

Is it possible to get the exact number of CPU cycles of a code in a C program?

I tried to use the C function clock and the assembly rdtsc but I only had a very rough approximation, and even with loops I didn't manage to get enough accuracy.

You can find below a code that I tried (unsuccessfully). For example, to get the cycles of an incrementation, I wanted to do

clk("++foo") - clk("")

hoping to get "1".

#define __clk(x)    tmp=clock() ;\
x;\
return abs(tmp-clock());

inline int clk(char* x)
{
    __clk(x)
}

Do you know if there is a way to get what I want? I'm currently doing C on Debian, but if needed I also have a Windows system, and if a solution is only available in another language, that's not a problem.

Jongware
  • 22,200
  • 8
  • 54
  • 100
Maxime B.
  • 1,116
  • 8
  • 21
  • Hi, I have passed by such similar idea before which I think it might be somehow useful for you to solve your problem. you can find the post on this link [link](http://stackoverflow.com/questions/5248915/execution-time-of-c-program) Hope it will help you. –  Jul 31 '16 at 08:01
  • First define exactly by what you mean by how long a piece of code takes in the context of OoOE and superscalar execution, because that doesn't really have a "natural meaning". This is especially important if you intend to measure really tiny pieces of code. – harold Jul 31 '16 at 08:03
  • 2
    The number of cycles is going to be different each time, what use is there in the exact number? – n. m. could be an AI Jul 31 '16 at 08:06
  • @harold I would run very short codes (i.e. from 10 to 1000 cycles). – Maxime B. Jul 31 '16 at 08:18
  • @n.m If I run the program with a non-preemptive scheduler, I don't see why it would be different each time? – Maxime B. Jul 31 '16 at 08:18
  • @MaximeB. : the instructions can still be scheduled differently by the CPU, or the execution time might be different due to code/data being (or not being) in the cache. – Daniel Kamil Kozar Jul 31 '16 at 08:19
  • 1
    "i wanted to do clk("++foo") - clk("") hoping to find 1." You should stop hoping that because in C which operand of `-` operator will be evaluated first is unspecified. – MikeCAT Jul 31 '16 at 08:23
  • @DanielKamilKozar That's why I'd like to have only the **user** time, if it's possible obviously! – Maxime B. Jul 31 '16 at 08:23
  • 2
    Zillion reasons. Cache, paging, adaptive branch prediction... – n. m. could be an AI Jul 31 '16 at 08:25
  • @MikeCAT Well look at my maccro. I know it's far from being perfect, but the order the operand - is evaluated **outside** the maccro shouldn't change anything. – Maxime B. Jul 31 '16 at 08:27
  • @Zillion And do you know if there is any way to disable it? – Maxime B. Jul 31 '16 at 08:28
  • 1
    `rdtsc` shouldn't be used alone since it can give wrong results because of instruction reordering. `rdtscp` is better if your cpu supports it, see e.g. http://stackoverflow.com/a/27697754/6600109 – Markus Laire Jul 31 '16 at 08:29
  • 1
    You can't disable how a modern processor fundamentally works. Even if you could it would just give you a result that would normally be meaningless. If you intend to measure the latency of an `inc` instruction, you can do that, but not this way. – harold Jul 31 '16 at 08:30
  • @harold "If you intend to measure the latency of an inc instruction, you can do that, but not this way." So what is the right way to do a such thing? – Maxime B. Jul 31 '16 at 08:33
  • @MaximeB. prime the CPU with some useless work to get into turbo (or disable turbo), then approximately measure the time it takes to execute about a thousand *dependent* `inc`'s. Do this several times and take the shortest time. Multiply the time by the ratio between your base clock and turbo (if you didn't distable) then round. For ex. I found that 1024 dependent `inc`'s take about 874 * (4.1/3.5) cycles on my Haswell processor, so I conclude `inc` has a latency of 1. – harold Jul 31 '16 at 08:42
  • 1
    @MaximeB. - Lets make a grand assumption that you could get the exact number of CU cycles for a bit of code? What use would it be? – Ed Heal Jul 31 '16 at 08:46
  • @EdHeal The use of this would be to test the complexity of a code to compare some algorithms. I'd like to have something like 10 n³+ 250n²+1000n+O(n) rather that just O(n³)... I don't want to care about machine-specific further optimization, just get something to see roughly the performance of any algorithm. – Maxime B. Jul 31 '16 at 09:16
  • You do not understand what complexity means. Please look up big O notation. It has nothing to do with actual time. It is a measurement of how the time grows with respect to the problem size – Ed Heal Jul 31 '16 at 09:26
  • @EdHeal I know this ... And the goal of this project to preview the behavior of big tasks, without actually run it. It is related to the complexity while it's not the exact word. – Maxime B. Jul 31 '16 at 09:49
  • 1
    But having these "numbers" on the micro level will tell you nothing about the macro level – Ed Heal Jul 31 '16 at 10:00

0 Answers0