8

In my application, threads need to pause for a very little time (100s of clock cycles). One way to pause is to call nanosleep, but I suppose it requires a system call to the kernel. Now I want to pause without going to the kernel.

Note that I have enough cores to run my threads on and I bind each thread to a separate core, so even an instruction that can halt the core for a little while would be good. I am using x86. I just want the thread to halt while pausing. I don't want a busy loop or a system call to the kernel. Is it possible to do this? What is the minimum time I can pause a thread?

MetallicPriest
  • 29,191
  • 52
  • 200
  • 356

6 Answers6

7

_mm_pause in a busy-wait loop is the way to go.

Unfortunately the delay it provides can change with each processor family:

http://siyobik.info/main/reference/instruction/PAUSE

Example usage for GCC on Linux:

#include <xmmintrin.h>

int main (void) {
    _mm_pause();
    return 0;
}

Compile with MMX enabled:

gcc -o moo moo.c  -march=native

Also you can always just use inline assembler:

__asm volatile ("pause" ::: "memory");

From some Intel engineers, you might find this useful to determine the cost of pausing:

NOP instruction can be between 0.4-0.5 clocks and PAUSE instruction can consume 38-40 clocks.

http://software.intel.com/en-us/forums/showthread.php?t=48371

Steve-o
  • 12,678
  • 2
  • 41
  • 60
  • How to use _mm_pause in gcc? Which header file is required? – MetallicPriest Sep 10 '11 at 16:05
  • Steve-o, thanks, but what is the purpose of ::: "memory" here? Is it memory fence or something? If so, wouldn't it increase cache coherence load? – MetallicPriest Sep 10 '11 at 16:47
  • @MetallicPriest it is to make sure the pause runs in the correct place. – Steve-o Sep 10 '11 at 16:58
  • `nop` doesn't stall out-of-order execution, or even need any back-end execution resources. `nop` throughput is 4 per clock on everything since Core2 (except Atom). https://agner.org/optimize/. Anyway, `nop` doesn't "take cycles" any more than `add` does, it consumes front-end bandwidth and instruction-cache space. **Anyway, `pause` on Sandybridge-family is about 5 cycles until Skylake, where it's about 100 cycles. Are your numbers from Pentium4 or something?** – Peter Cordes Aug 18 '18 at 11:59
1

Why don't you just spin-wait yourself? You can, in a loop, repeatedly call the rdtsc instruction to get the clock cycle count and then just stop if the difference exceeds 100 clock cycles.

I presume it's for a trading system, for which this is a common technique

Foo Bah
  • 25,660
  • 5
  • 55
  • 79
0

It depends on what you mean by pause. If by pause you want to stop the thread for a short period of time, only the OS can do this.

However, if by pause you want a very short delay, you can do this with a busy loop. The problem with using such a loop is you don't know how long it is really running for. You can estimate it, but an interrupt can make it longer.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
0

Generally speaking, for such a short delay, I would think a system call is not practical, because the overhead of system call + scheduling + context switch and back again is going to be way longer then your pause, as you seem to already be aware.

What you are left with is to spin (busy wait) to produce the delay. You can loop reading TSC values to know how much to spin, for example (or applicable cycle counter register for other processors)

Yes, spinning like this indeed wastes power, and if you are running on a CPU with multiple hardware threads, as multi core usually implies, you are also taking execution slots from the other threads needlessly, but unless you have a very very very low overhead system call and scheduler mechanism AND a high res timer, I'd say it's not possible.

gby
  • 14,900
  • 40
  • 57
-1

no - it is not possible. Either call sleep or select - kernel; or have a loop have wastes time.

Ed Heal
  • 59,252
  • 17
  • 87
  • 127
-2

Your only hope of getting timing that precise is using timer_create and having the timer expiration delivered by a signal, I think. With realtime scheduling.

I'm not sure what it could possibly be useful for, though, since you cannot perform any IO in such a small time window, and therefore it should not matter if your code runs 1000 times with a 100ns gap between each run, or 1000 times all together with a 100ms gap at the end.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711