1

I'm trying to make a sleep-like function to be used in a simple kernel I made with C that runs on a virtual machine, using a loop where I'm trying to make each run of the loop as close to 1 nanosecond as possible, so for this reason, I decided to write it in assembly. My CPU is a Sandybridge and its clock rate is 2.7GHz, so 1 clock cycle is 10/27ns and 2.7 clock cycles are 1ns. I looked up the throughput of the instructions I needed in the Sandybridge section of Agner Fog's instruction table, here's a list of the ones I needed (showing latency too just in case):

Instruction Operands Latency Reciprocal throughput
NOP 0.25
ADD SUB mem, reg/imm 6 1
Cond. jump short/near 0 1-2

Since jcc has a throughput of 1-2 clock cycles, I averaged it to 1.5. My function looks like this:

BITS 32

section .text
    global sleep_for

; void sleep_for(unsigned n)
sleep_for:
    nop                  ; 2.70 - 0.25 = +2.45
    sub DWORD [esp+4], 1 ; 2.45 - 1.00 = +1.45
    jnz short sleep_for  ; 1.45 - 1.50 = -0.05
    ret                  ; Each iteration = 2.75 (1.01851852ns)

The comments describe what I think should be happening, and the result seems to be close to 1ns at least. But when I try sleep_for(1000000000) (or sleep for 1s) in my C program, it ends up waiting for about 4.35s instead. Why is the function waiting for so much longer than it should?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
mediocrevegetable1
  • 4,086
  • 1
  • 11
  • 33
  • 1
    _"Since jcc has a throughput of 1-2 clock cycles, I averaged it to 1.5"_ If it's 2 when the jump is taken, wouldn't your average be very close to 2? (except when the loop count is very small). – Michael Apr 22 '21 at 18:42
  • 1
    @Michael true, I hadn't thought about that, but even then it would take 3.25 clock cycles per run max, which is 1.2037037ns. Still doesn't seem like that would cause a delay of ~4.35 seconds. – mediocrevegetable1 Apr 22 '21 at 18:48
  • 1
    Are you sure the clock stays at 2.7 GHz throughout the test? Modern CPUs and OSes will vary the clock rate dynamically according to system load, temperature, and many other factors. Of course, the duration will also increase if your process is scheduled out. – Nate Eldredge Apr 22 '21 at 19:11
  • 1
    @NateEldredge I hadn't thought about that, I just read the system info, which said the clock rate was 2.7GHz. Any OS-related thing doesn't really come into play (I think) as I'm running this on a very basic kernel I've made (hardly a kernel yet, I just want to make some basic functions first). I print something to video memory before the sleep function begins and then print something again at the end. I *am* testing this on a virtual machine, not sure if that's really relevant. – mediocrevegetable1 Apr 22 '21 at 19:18
  • 1
    Perhaps the instruction `RDTSC` might be useful for you: https://www.felixcloutier.com/x86/rdtsc – vitsoft Apr 22 '21 at 19:37
  • @mediocrevegetable1: So then the duration will increase if the *hypervisor* is scheduled out by the host OS. The host OS, as well as the hardware itself, would also be responsible for adjusting the CPU clock speed, so you won't have much control over that. Anyway, probably best to get away from delay loops as a timing technique as quickly as you can. It's not really a good approach on anything but an embedded system. – Nate Eldredge Apr 22 '21 at 20:57
  • 1
    I do not understand your calculations. Also note that due to store forwarding, weird things may happen with the addition. Consider operating on registers only. – fuz Apr 22 '21 at 21:18
  • @NateEldredge I just looked that up, that's definitely an issue. As for alternatives, I looked some up and they involved setting up a timer event or use the `alarm` POSIX function in a system where it exists, so it seems like for my kernel I'm going to have to create some driver or another to make this work, meaning I might just have to leave this for now then... – mediocrevegetable1 Apr 23 '21 at 03:27
  • @fuz just tried storing it in `ecx` before entering the loop, it *did* run at about 4.25 seconds this time, so perhaps a bit of an improvement over using memory directly. I originally tried to avoid using registers because for some reason, the entry of **"ADD SUB(reg, reg/imm)"** had nothing in the "throughput" column, so I wasn't sure. – mediocrevegetable1 Apr 23 '21 at 03:36
  • @vitsoft are you saying that I could use the time-stamp counter as a way to stop when the time finishes? That could possibly be an alternative. – mediocrevegetable1 Apr 23 '21 at 03:41
  • 1
    The loop simply bottlenecks on store-forwarding + sub latency, not throughput of anything. As @fuz said, adding up or averaging throughput of different instructions makes no sense; those are either from back-end port limits or front-end width. If you're already running 4 NOPs / clock, you can't also fit in a macro-fused sub/JCC. Conversely, a NOP won't slow down a 2 uop loop that bottlenecks on the 1 iter/clock limit for tiny loops (or on latency), it will just take an extra slot in the front-end. – Peter Cordes Apr 23 '21 at 04:07
  • 1
    This loop should run at about 1 iteration per 6 clock cycles, bottlenecked on store-forwarding latency (~5c) + sub (1c). RDTSC counts reference cycles, not core clock cycles, so it's not useful to find what actual frequency you ran. 6e9 / 4.32 ~= 1.39GHz estimated CPU frequency if this loop ran in 4.32 seconds, @NateEldredge. (Unless it ran slower somehow, maybe from the stack being misaligned? I don't think code alignment could slow it down any more than the store/reload bottleneck though; 6 cycles is plenty to get this loop into the pipeline.) – Peter Cordes Apr 23 '21 at 04:17
  • 1
    BTW, for `sub mem, imm` *throughput* to be relevant, you'd need to be running it with enough different pointers to avoid a latency bottleneck. e.g. looping over an array, so multiple load/sub/store operations could all be in flight at once *on different data*. Also in that case, note that 6 cycles is more or less worst-case for SnB-family. It can be better if you don't reload right away: [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685) – Peter Cordes Apr 23 '21 at 04:20
  • @PeterCordes I see. So I guess my whole thought process is flawed then :\ it seems that I can't be accurate enough with a loop then due to all of this, so I'm probably going to look for other options for a sleep function. Thanks for the help. – mediocrevegetable1 Apr 23 '21 at 04:47
  • Did you see the linked duplicates? [How to calculate time for an asm delay loop on x86 linux?](https://stackoverflow.com/q/49924102) has a delay loop that spins on an RDTSC deadline. Of course that's very wasteful of power for long sleeps; better to at least use `hlt` in a loop and only check for the TSC being close when you're interrupted.(Or if 10 millisecond precision for long-duration sleeps is ok, just use the coarse timing of a timer interrupt, assuming a 100Hz timer like Linux historically used.) Ideally do low-power sleeps with `mwait` instead of `hlt` to save much more power. – Peter Cordes Apr 23 '21 at 04:56
  • But yeah, 8086 / microcontroller style dead-reckoning delay loops with a calculated count are a bad idea for anything except the most simple cases (e.g. wait for *at least* a few nanoseconds, maybe longer, between I/O operations, where it's ok to wait 5x or 10x as long if the CPU happens to be at idle frequency, not max turbo, or if you handle an interrupt during the delay). So yes, you'll want a different strategy, even if you calibrate for CPU frequency. (Fun fact: Linux's bogomips *is* a delay-loop calibration factor still used for cases where the delay's so short it's not worth sleeping). – Peter Cordes Apr 23 '21 at 04:59
  • Of course, as well as timer interrupts, I think you can program the HPET to deliver an interrupt at a specific time, so you can still go into a deep sleep and get woken up at a certain time if nothing else has woken you earlier, even in a "tickless" kernel that avoids a regular timer interrupt. (You may want to google on some of those keywords, like "tickless" kernel, and read up on strategies to handle timing in mainstream OSes like current Linux vs. older Linux, and existing toy / example OSes that might just use a timer interrupt. Then pick one that sounds fun for you.) – Peter Cordes Apr 23 '21 at 05:05
  • @PeterCordes trying to process the `rdtsc` example you've written in the duplicate right now. I'll check out those terms too, sound helpful. – mediocrevegetable1 Apr 23 '21 at 05:16
  • 1
    @PeterCordes I just tried the code in the dupe target and (after a few changes to make it abide by the 32-bit calling convention instead and changing 3200000000 to 270000000 to match my clock rate) it worked! Thanks a lot for the help. – mediocrevegetable1 Apr 23 '21 at 07:25

0 Answers0