10

I adopted online to measure SSE performance.

#ifndef __TIMER_H__
#define __TIMER_H__

#pragma warning (push)
#pragma warning (disable : 4035)    // disable no return value warning

__forceinline  unsigned int GetPentiumTimer()
{
    __asm
    {
        xor   eax,eax             // VC won't realize that eax is modified w/out this
                                  //   instruction to modify the val.
                                  //   Problem shows up in release mode builds
        _emit 0x0F                // Pentium high-freq counter to edx;eax
        _emit 0x31                // only care about low 32 bits in eax

        xor   edx,edx             // so VC gets that edx is modified
    }
}

#pragma warning (pop)

#endif

I did the measurement on my Pentium D E2200 CPU, and it works fine (it shows aligned SSE instructions are faster). But on my i3 CPU I get unaligned instructions faster 70% of the tests.

Do you guys think this clock tick measurement is not suitable for i3 CPU?

Bart
  • 19,692
  • 7
  • 68
  • 77
CppLearner
  • 16,273
  • 32
  • 108
  • 163
  • I'm pretty sure VC supports the `RDTSC` instruction in inline asm. Also why don't you care about the upper 32-bits, and you should be using `__declspec(naked)` or even better return a value in a more proper way. Besides I'd want to use [`QueryPerformanceCounter`](http://msdn.microsoft.com/en-us/library/windows/desktop/ms644904\(v=vs.85\).aspx) or similar functions instead (noting the problems with frequency scaling / multi-core processors etc.). – user786653 Nov 28 '11 at 18:31
  • 1
    RDTSC is *not* a serializing instruction, meaning it can/will be executed out of order. If you insist on using it directly, you usually want to use CPUID to force serialization (it's one of the few serializing instructions you can execute in user mode). – Jerry Coffin Nov 28 '11 at 18:32
  • I have QueryPerformanceCounter too. It isn't very reliable according to the results. For nxn matrices multiplication, n = 10000 or higher, time takes only 0.3 seconds? I don't think that's accurate at all (on console it takes more than 2 seconds to see the results), so I turn to the clock ticks. I am going to try RDTSC now. Thanks. – CppLearner Nov 28 '11 at 19:01
  • I'd also recommend `QueryPerformanceCounter` assuming this is windows platform. – AJG85 Nov 28 '11 at 19:04
  • If you want to use raw `rdtsc`, do it with the `__rdtsc()` intrinsic. [Get CPU cycle count?](https://stackoverflow.com/a/51907627) – Peter Cordes Aug 19 '18 at 10:03

4 Answers4

4

QueryPerformanceCounter (on Windows at least) is definitely much better than inline assembly. I can't see any reason to use inline assembly (which will give you problems compiling to x64 on Visual Studio which doesn't support inline assembly) over that function.

AshleysBrain
  • 22,335
  • 15
  • 88
  • 124
2

As other noticed, you should use QueryPerformanceCounter.

but if you really want to use assembler, the best is may be to use the intrinsic __rdtsc.

If you you dont want to use the the intrinsic, then this would be the best aproach:

unsigned __int64 __declspec(naked) GetPentiumTimer() {
    __asm {
        rdtsc
        ret
    }
}

For my knowledge Visual C++ is refusing to do inline for any function which is using inline assembler anyway. By using the __declspec(naked) you would tell the compiler to deal with the register usage correctly.

But using the intrinsic would be the best thing, in this way the compiler would know which registers are used and it is inlined in the proper way.

rubenvb
  • 74,642
  • 33
  • 187
  • 332
ConfusedSushi
  • 874
  • 8
  • 16
  • No, MSVC can inline functions that use `__asm`, if you don't make them `naked`. But definitely use the `__rdtsc` intrinsic; it's portable across 32 / 64-bit, and to gcc/clang/ICC. [Get CPU cycle count?](//stackoverflow.com/a/51907627) – Peter Cordes Aug 18 '18 at 16:37
2

0F 31, which is RDTSC instruction, still may be useful to measure performance for short pieces of code. Even for i3 CPUs. If effects of task switching and migrating thread to different core do not bother you, it is OK to use RDTSC. In many cases you get more precise results forcing serialization with CPUID.

As for your measurements, it is quite possible that misaligned SSE is working faster on i3. Latest Intel processors (Nehalem and Sandy Bridge architectures) can handle misaligned memory operands very efficiently. Definitely, they will never outperform aligned instructions, but if some other factors influence performance in your tests, aligned instructions may seem to work slower.

Edit:

See http://www.agner.org/optimize/#testp. It is a good example of RDTSC instruction usage.

Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
1

QueryPerformanceCounter() is the easiest way to get a high frequency timer on Windows. However, it has a bit of overhead, since it is a system call — about ½μs. That can be a problem if you are timing very fast events, or need very high precision.

If you need better than 250 nanosecond precision, you can use the rdtsc intrinsic to get the hardware counter directly. It's about 10ns of latency on my i7.

Crashworks
  • 40,496
  • 12
  • 101
  • 170
  • `rdtsc` has no inputs, so its latency would be from issue to when its output registers are ready, I guess. Only meaningful after a branch miss or other front-end stall, and hard to measure. Perhaps you meant throughput? – Peter Cordes Aug 18 '18 at 16:35