How can I count how many clock cycles it takes for the rdtsc instruction to execute?

Question

I know that the unsigned long long gets stored in eax/edx but I'm wondering how can I find out how many clock cycles it takes to execute a single rdtsc instruction?

EDIT: Does something like this work?

.globl rdtsc

rdtsc:

rdtsc

movl %eax, %ecx

movl %edx, %ebx

rdtsc

subl %ecx, %eax

subl %ebx, %edx

ret

If this is a problem for you, then you aren't benchmarking your code properly. You need to run enough iterations so that the overhead of `rdtsc()` is negligible. — Mysticial, Nov 07 '12 at 01:56
The overhead of `rdtsc` has already been measured. See http://instlatx64.atw.hu/ — harold, Nov 07 '12 at 08:32

Olof Forshell · Answer 1 · 2013-02-08T10:47:08.233

Your code looks correct though you should run it several times and use the shortest value that comes up.

I think the question should be restated: what is the overhead of using rdtsc to count elapsed clock cycles during a code sequence. So the counting code is essentially (32-bit example):

rdtsc
mov dword ptr [mem64],eax
mov dword ptr [mem64+4],edx

; the code sequence to clock would go here when you're clocking it

rdtsc
sub eax,dword ptr [mem64]
sbb edx,dword ptr [mem64+4]    ; I always mix up sbb and sub so this may be incorrect

and the result is the practical elapsed time of the "rdtsc overhead" when timing a code sequence.

When you have subtracted the rdtsc overhead you need to factor in pipelining and if overlapping processing has completed. For me I assume that if the timed sequence runs in fewer than perhaps 30 cycles there may be uncompleted pipelining issues that need to be taken into account. If the sequence requires more than 100 cycles there may issues but they may be ignored.

So what about between 30 and 100? It's definitely gray.

score 1 · Answer 2 · edited May 23 '17 at 12:18

1

You could execute rdtsc repeatedly, and look at the difference between consecutive return values. Of course you need to bear in mind things like context switches etc, which will cause massive spikes.

See rdtsc, too many cycles for a discussion.

edited May 23 '17 at 12:18

Community

1
1

answered Nov 07 '12 at 01:57

NPE

486,780
108
951
1,012

Does something like this work? EDIT: Sorry I just put it in the main post – user1769152 Nov 07 '12 at 02:18
I'd go with `sbbl %ebx, %edx` to pick up the carry/borrow (if any) from the first `subl`. – Frank Kotler Nov 07 '12 at 04:34
In practice the two commands will never take 2^32 or more cycles to finish. The difference can be easily calculated with eax only. That also avoids the bug of subtraction without carry/borrow. – Aki Suihkonen Nov 07 '12 at 12:36

How can I count how many clock cycles it takes for the rdtsc instruction to execute?

2 Answers2

Linked

Related