3

For benchmarking purposes, I'm using rdtsc to determine how much pseudo-time I have spent executing some chunks of code inside a critical loop. Since my code can be rescheduled between processes at any moment, I would like to minimize the noise by just dumping the data if I find out that I have changed CPU between the start and the stop of the micro-measure.

Is there an x86 instruction I could use to identify on which CPU/core I'm running? Something that would give me either a unique identifier, or a CPU# and a core#, etc.

Apparently, cpuid doesn't provide the information anymore, so I'm looking for an alternative.

Yoric
  • 3,348
  • 3
  • 19
  • 26
  • 2
    Do you use your own OS? Because both Windows and Linux have the API for doing what you need: `GetCurrentProcessorNumber` and `sched_getcpu` respectively. –  Jul 20 '15 at 14:02
  • 1
    You should just pin your thread to a single core. See also [this question](http://stackoverflow.com/questions/22310028/is-there-an-x86-instruction-to-tell-which-core-the-instruction-is-being-run-on). – Jester Jul 20 '15 at 14:10
  • If your CPU would provide that information, it would do so through `cpuid`. But it doesn't for security reasons. – fuz Jul 20 '15 at 14:24
  • 1
    This program employs Intel's MP Initialization Protocol to awaken any auxilliary processors that may be present and allows each processor to display its APIC Local-ID. [link]http://www.cs.usfca.edu/~cruse/cs630/mphello.s – Dirk Wolfgang Glomp Jul 21 '15 at 08:54
  • @Jester I can't do that in practice, otherwise my users will hate me :) – Yoric Jul 21 '15 at 13:17
  • @knm241Ah, I wasn't aware of these, thanks. Any idea how fast they are? In case of high activity, my loop is executed ~10k iterations per frame, so I'd hate to slow things down just to check that my benchmark numbers are correct. – Yoric Jul 21 '15 at 13:19
  • you could use `numactl ...` with the appropriate options to bind a process to a specific core when it is started. Here is the [man page](http://linux.die.net/man/8/numactl) the exact options would depend on the configuration you want. – Matt Jul 21 '15 at 14:36

1 Answers1

1

Not really. You don't really want a separate instruction, since your thread could have been migrated to another core immediately after executing the instruction and before you could do anything useful with its result (or between the RDTSC and the "what core am I on?" check). RDTSCP conveniently avoids this, because it both delivers the TSC and returns whatever the OS told it to in a single instruction, but it requires OS support and is a serialising instruction (so more heavyweight than RDTSC, which may affect fine-grained timing measurements).

As someone noted in the comments, if the precision of every measurement is important to you, you probably need to use an OS API to pin the thread to ensure it doesn't migrate. Alternatively, migration will be relatively rare, so if you can take enough measurements, then the occasional outliers that it causes will be obvious and easy to exclude.

ab.
  • 345
  • 2
  • 9
  • I am quite aware of the possibility of migrating between a call to `RDTSC` and a hypothetical `RDWHEREAMI`. I don't want a strong guarantee that my final call to `RDTSC` is on the same CPU as my first call, but I would like to be able to measure how many times this isn't the case (which should be roughly covered by `RDWHEREAMI`), so that I can adjust my statistics (or at least be reassured). – Yoric Aug 29 '15 at 06:14
  • @Yoric, well I stand by my assertion that if you only care about performance statistics, and you are measuring a small-enough block of code to make RDTSC important, then migrations should be rare and obvious outliers, much like context switches. In any case, identifying the processor ID is OS-specific. If your OS sets up the TSC_AUX MSR appropriately, then RDTSCP can help. Otherwise, I know that `GetCurrentProcessorNumber()` is cheap on Windows (it avoids a syscall). I'm not sure about the Linux equivalent. – ab. Aug 30 '15 at 20:09
  • Actually, I'm measuring a pretty large block of code, that just happens to run a few thousands-to-millions times per second, and I have started using `RDTSC` only because my previous attempts were clearly slowing down this critical loop. I _think_ that the code is not migrated often, but having a way to measure it in a cross-platform, no-syscall way would be nice to have real-world confirmation. In the meantime, I use `GetCurrentProcessorNumber()` and `sched_getcpu` (and I haven't found anything equivalent for mach/MacOS X). – Yoric Aug 31 '15 at 21:24