0

I want to measure the speed in which my PC can increment a counter N times (e.g., for N = 10^9).

I tried the following code:

using namespace std
auto start = chrono::steady_clock::now();
for (int i = 0; i < N; ++i)
{
}
auto end = chrono::steady_clock::now();

However, the compiler is smart enough to simply set i=N, and I get that start==end regardless of the value of N.

How can I change the code to measure the increment speed? (adding costly operations in the loop would dominate the runtime and would not allow the measurement to be correct).

I use Windows 10 and Visual Studio 15.9.7.


A bit of motivation: my code takes about 2 seconds for N=10^9. I'm wondering if there's any "meat" left in optimizing it further (e.g., could it possibly go down to 1 sec? or would the loop itself require more?)

R B
  • 543
  • 9
  • 25
  • 1
    Lookup instruction timing in the processor manual. Any other way is nonsense. And given caching of tight loops, the instruction timing will be off too. – Paul Ogilvie Jul 29 '19 at 13:58
  • @PaulOgilvie- this would probably not factor in the overheads of running it in VS? I'm trying to optimize the inner code of a loop and am wondering what is the "baseline" timing that running the loop itself take. – R B Jul 29 '19 at 14:00
  • As your loop is empty, perhaps your compiler is optimizing it out? – Fred Larson Jul 29 '19 at 14:02
  • It is hard to do this portably, but see this answer https://stackoverflow.com/a/11485388/576911 for how it works on one platform. Specifically see the `test_empty_loop` function. – Howard Hinnant Jul 29 '19 at 14:03
  • @FredLarson, yes, that's the problem. I am trying to optimize a very lightweight set of operations inside of the loop (not shown in the question's code) and was wondering how fast could it possibly get. So the best I could hope for is the time it takes to run `N` increments. – R B Jul 29 '19 at 14:03
  • Compiler will probably recognize that `volatile int i` in the given scope cannot be volatile an so back to square one. – Paul Ogilvie Jul 29 '19 at 14:04
  • 2
    I highly doubt there is any point in this exercise. If you want to optimize some *specific* calculation, your "baseline" should be the unoptimized code. And you go from there. – Eugene Sh. Jul 29 '19 at 14:05
  • Does it have to be a loop counter? Which platform? If the value is in a register, then for x86 ALU, that would be a latency of 1, throughput of 0.5. The time therefore depends only on the clock speed of the processor. – Wyck Jul 29 '19 at 14:05
  • @EugeneSh., my problem is that I'm not sure if there's a point in working on optimizing the code further, so I'm wondering what's the time it takes to complete the look without my code. – R B Jul 29 '19 at 14:06
  • If you don't do anything else in the loop, you can increment a variable 10^9 times by doing `i += 1e9`. It's *very* fast. – Wyck Jul 29 '19 at 14:11
  • you can force the compiler to not optimize the loop, and measure that, but what is that information good for? Instead you should put your real code and measure that. A loop that does nothing should take zero time and thats what you are measuring currently, any other outcome would be misleading. Measuring performance should always include compiler optimizations – 463035818_is_not_an_ai Jul 29 '19 at 14:12
  • @formerlyknownas_463035818, my problem is that for N=10^9 my code takes about 2 seconds. I'm wondering if there's any "meat" left in optimizing it further (e.g., could it possibly go down to 1 sec?) – R B Jul 29 '19 at 14:19
  • what about stopping and starting the timer inside the loop and adding up the times. In this way thecompiler cannot optimize away the loop. Though not sure how meaningful results you can get in that way. – 463035818_is_not_an_ai Jul 29 '19 at 14:20
  • the "meat" is most likely in the code you removed from the loop. I have a hard time to imagine that your loop body in the real code is so tight that incrementing an integer adds anything measurable to the runtime – 463035818_is_not_an_ai Jul 29 '19 at 14:21
  • @formerlyknownas_463035818, that's exactly what I'm trying to figure out. Is my CPU capable of running 10^9 iterations per second of an empty loop? – R B Jul 29 '19 at 14:25
  • Imho this question is based on a misunderstanding. There is no direct mapping of the instructions you write in your code to instructions of the cpu. Instead, sloppy speaking, the compiler reads the code, understands what it is supposed to do and only then generates cpu instructions. Forcing the compiler to waste cycles for nothing will not tell you much about where the cycles go in your real application. – 463035818_is_not_an_ai Jul 29 '19 at 14:28
  • anyhow if you really want to do that experiment, you could write a loop that actually does something, measure it as baseline, then add a counter to that loop and increment it in each iteration, measure again and compare. – 463035818_is_not_an_ai Jul 29 '19 at 14:29
  • `++i` is the best way to increment your counter, Not sure what it will take to convince you though. – Wyck Jul 29 '19 at 14:57
  • Since you're on Windows + Visual Studio you might be using CLI. Running native is much faster CLI. Disregard this comment, if that's not your case. – nada Jul 29 '19 at 15:27
  • 1
    I recommend looking up the increment instruction for your processor, then finding out how many clock cycles the instruction takes. Also look up the speed of a clock cycle. This should give you a rough, but more accurate, method than writing a high level program. – Thomas Matthews Jul 29 '19 at 16:17
  • "Is my CPU capable of running 10^9 iterations per second of an empty loop?". Probably yes. I'd expect an inc + jmp loop to run at about 1 or 2 cycles per iteration (but I'm not an expert). However, the compiler might unroll your loop, which would make this question irrelevant. The "meat" in your optimization task might be to manually unroll the loop so you can manually remove any dependencies of one loop iteration on the results of the previous iteration. Also, even without unrolling, if there are no dependencies between loop iterations, the CPU can execute multiple iterations in parallel. – Andrew Bainbridge Jul 30 '19 at 09:21

3 Answers3

4

This question doesn't really make sense in C or C++. The compiler aims to generate the fastest code that meets the constraints defined by your source code. In your question, you do not define a constraint that the compiler must do a loop at all. Because the loop has no effect, the optimizer will remove it.

Gabriel Staple's answer is probably the nearest thing you can get to a sensible answer to your question, but it is also not quite right because it defines too many constraints that limits the compiler's freedom to implement optimal code. Volatile often forces the compiler to write the result back to memory each time the variable is modified.

eg, this code:

void foo(int N) {
    for (volatile int i = 0; i < N; ++i)
    {
    }
}

Becomes this assembly (on an x64 compiler I tried):

        mov     DWORD PTR [rsp-4], 0
        mov     eax, DWORD PTR [rsp-4]
        cmp     edi, eax
        jle     .L1
.L3:
        mov     eax, DWORD PTR [rsp-4] # Read i from mem
        add     eax, 1                 # i++
        mov     DWORD PTR [rsp-4], eax # Write i to mem
        mov     eax, DWORD PTR [rsp-4] # Read it back again before
                                       # evaluating the loop condition.
        cmp     eax, edi               # Is i < N?
        jl      .L3                    # Jump back to L3 if not.
.L1:

It sounds like your real question is more like how fast is:

L1:    add     eax, 1
       jmp     L1

Even the answer to that is complex and requires an understanding of the internals of your CPU's pipelines.

I recommend playing with Godbolt to understand more about what the compiler is doing. eg https://godbolt.org/z/59XUSu

Gabriel Staples
  • 36,492
  • 15
  • 194
  • 265
Andrew Bainbridge
  • 4,651
  • 3
  • 35
  • 50
  • 1
    Why not use the [increment instruction](https://c9x.me/x86/html/file_module_x86_id_140.html)? – Thomas Matthews Jul 29 '19 at 16:14
  • 1
    @ThomasMatthews there [are reasons](https://stackoverflow.com/a/36510865/555045), likely none of them directly apply to OPs CPU it is common practice to use add/sub anyway – harold Jul 29 '19 at 16:27
1

You can directly measure the speed of the "empty loop", but it is not easy to convince a C++ compiler to emit it. GCC and Clang can be tricked with asm volatile("") but MSVC inline assembly has always been different and is disabled completely for 64bit programs.

It is possible to use MASM to side-step that restriction:

.MODEL FLAT
.CODE

_testfun PROC
    sub ecx, 1
    jnz _testfun
    ret
_testfun ENDP

END

Import it into your code with extern "C" void testfun(unsigned N);.

harold
  • 61,398
  • 6
  • 86
  • 164
  • Subtraction may have different clock cycles than addition. OP's issue was with incrementing. – Thomas Matthews Jul 29 '19 at 16:12
  • @ThomasMatthews I don't agree that OP is specifically asking about `inc`, the point as I understand it is the loop overhead in general and the code above is a common implementation of a counted loop. An other pattern is an `add/cmp/jcc` trio, which has slightly different performance characteristics but this experiment is insufficient to show that (the difference could be shown by adding more stuff to the loop) – harold Jul 29 '19 at 16:30
0

Try volatile int i = 0 In your for loop. The volatile keyword tells the compiler this variable could change at any time, due to outside events or threads, and therefore it can't make the same assumptions about what the variable might be in the future.

Gabriel Staples
  • 36,492
  • 15
  • 194
  • 265
  • Compiler will probably recognize that `volatile int i` in the given scope cannot be volatile. – Paul Ogilvie Jul 29 '19 at 14:05
  • 2
    It does make the compiler "actually do the loop", but unfortunately it also makes the loop slower than the speed OP intends to measure – harold Jul 29 '19 at 14:09
  • Thanks for the suggestion. Unfortunately, it doesn't seem to help. It does allow me to time the empty loop, but the time it takes is longer than with my code (without `volatile`). – R B Jul 29 '19 at 14:12
  • This shouldn't have been voted down. It is a reasonable suggestion. – Andrew Bainbridge Jul 29 '19 at 14:19
  • @PaulOgilvie Can it do it? Anything can be `volatile`, if not for production then for debugging. – Eugene Sh. Jul 29 '19 at 14:22
  • @EugeneSh., `for (volatile int i` is a var in the loop scope. It/its address is not known outside the loop, hence there is nothing that can volatily update it, so `volatile` can be ignored, in my opinion (unless its adress is passed to some function?) – Paul Ogilvie Jul 29 '19 at 14:26
  • @PaulOgilvie The debugger can. Sometimes you want to have some loop controlled by a `volatile` variable to be able to break in debugger and modify it to change the control flow. The compiler should respect that. – Eugene Sh. Jul 29 '19 at 14:28
  • @EugeneSh. I feel that has nothing to do with the compiler. And I have never heard of a loop written "to be controled by a volatile variable". If it is optimized out, then you cannot change it in the debugger. And changing would have no use because the variable does not influence any thing (hence optimized out). – Paul Ogilvie Jul 29 '19 at 14:32
  • @PaulOgilvie All major compilers are careful to respect `volatile` access, even when they can prove it doesn't matter. [The results are *hilarious*](https://godbolt.org/z/wB2_RU). – Sneftel Jul 29 '19 at 14:33