4

Imagine I want to have one main thread and a helper thread run as the two hyperthreads on the same physical core (probably by forcing their affinity to approximately ensure this).

The main thread will be doing important high IPC, CPU-bound work. The helper thread should do nothing other than periodically updating a shared timestamp value that the the main thread will periodically read. The update frequency is configurable, but could be as fast as 100 MHz or more. Such fast updates more or less rule out a sleep-based approach, since blocking sleeps are too slow to sleep/wake on a 10 nanosecond (100 MHz) period.

So I want a busy wait. However, the busy wait should be as friendly as possible to the main thread: use as few execution resources as possible, and so add as little overhead as possible to the main thread.

I guess the idea would be a long-latency instruction that doesn't use many resources, like pause and that also has a fixed-and-known latency. That would let us calibrate the "sleep" period so no clock read is even needed (if want to update with period P we just issue P/L of these instructions for a calibrated busy-sleep. Well pause doesn't meet that latter criterion, as its latency varies a lot1.

A second option would be to use a long-latency instruction even if the latency is unknown, and after every instruction do a rdtsc or some other clock reading method (clock_gettime, etc) to see how long we actually slept. Seems like it might slow down the main thread a lot though.

Any better options?


1 Also pause has some specific semantics around preventing speculative memory accesses which may or may not be beneficial to this sibling thread scenario, since I'm not in a spin-wait loop really.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • Seems like a lot of effort for little or no value. Your main thread needs to read the timestamp - why bother with a "helper thread"? Just read the CPU time or whatever you need in your main thread when/where you actually need it... It's not supposed to be slower than reading it from "a shared timestamp"... – Amit Aug 07 '17 at 22:47
  • @Amit - why don't _you_ tell me how long a single L1-resident read from the main thread takes versus say `clock_gettime`? Do you think the former is twice as fast? Five times? I guess I appreciate your apparent concern that I might waste my effort, but I'm OK with experimentation. – BeeOnRope Aug 07 '17 at 22:57
  • 2
    Square roots maybe? Single µop, huge latency, but would burn some power and affect square roots done by the main thread (does it do any?). And I wonder how intentionally mispredicted branches would act. – harold Aug 07 '17 at 23:36
  • @harold - thanks! I had scrolled through Agner's instruction timings doc and the "heavy" x87 FP instructions like `fdiv` and sqrt definitely stuck out as having long latency with only one uop (versus for example integer division which has a ton of uops). It's wasn't clear to me if this blocks the port (p0 on Skylake) the whole time or only the FP unit (or some sub-part) as you suggest? I'm not going to be doing any FP on the main thread - but integer AVX possible. Great suggestion though, I'll test it. – BeeOnRope Aug 07 '17 at 23:41
  • 3
    I'm reasonably sure that it's only the FU that is blocked (if it's not pipelined) and the port is only used for issue/writeback. So eg a bunch of `movd`'s is only slowed down *slightly* by the occasional `sqrtsd`. – harold Aug 08 '17 at 00:04
  • 1
    Experimentation is fine, and just having fun playing around is great, but for practical reasons as well as portability and compatibility - it's something I'd avoid. Too answer your question, I have no idea what the timings are, but I'm willing to bet (quite heavily) it's negligible compared to anything else you do in your CPU intensive task. If you have a tight inner loop, wrap it with a slower loop (1 : 10000 for example) and do your lookup there. That's my 2 cents anyway... – Amit Aug 08 '17 at 04:50
  • I'm even willing to bet the costs of synchronization of the shared variable between the two threads will be greater that reading the value in the first place. – spectras Aug 09 '17 at 11:38
  • @spectras - it could be, but it is different than my understanding (when it comes to hyperthreaded siblings). It's important enough that I've asked a [separate question here](https://stackoverflow.com/q/45602699/149138) exactly covering that topic. – BeeOnRope Aug 10 '17 at 00:38
  • As it turns out, this idea [isn't too crazy](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/preprint.pdf) at all - although the approach there is somewhat inverted (the helper thread reads the counters and notes the IP of the primary thread, rather than my approach of having the helper thread write counters that the main thread reads). – BeeOnRope Aug 10 '17 at 22:35
  • Is the sleep period supposed to be constant-time or constant *clock cycles*? Because the max turbo frequency depends on a lot of things other than just what this core is doing; e.g. how many other cores are active, as well as thermal and power conditions. – Peter Cordes Aug 13 '17 at 04:10
  • Either is OK and both are useful in slightly different scenarios (i.e., measuring cycles vs real time), but ultimately one can be converted into the other, since the sleep doesn't necessarily have to relate the underlying metric. For example, perhaps you want to measure real time, but the sleep happens to be a fixed number of cycles: when you wake from sleep, you just issue a `rdtsc` and write that as you "real time" value. The reverse case works similarly (with a `rdpmc` call, for example, to read cycles). @PeterCordes – BeeOnRope Aug 13 '17 at 04:13
  • However, in the case that the "sleep" unit happens to line up with the underlying metric you want to measure on the primary thread **and** it has a predictable latency, you can perhaps avoid the `rdtsc` or `rdpmc` or whatever call to read the underlying entirely and just calculate it directly. For example, if you know `fsqrt` always takes 33 cycles, and you want to measure `cycles`, you just increment the cycles variable by 33 (or 33 + N for the increment) in between each `fsqrt. So that's a useful feature, but the _main_ purposes of the sleep is to be friendly to the other core. – BeeOnRope Aug 13 '17 at 04:14
  • 1
    It would be interesting to play with FP denormals for mul or div, or x87 slowdowns with NaN/Infinity. Those have *very* high latency, but IDK if the "FP assist" generated to handle that is a lot of microcode uops, or if it's done inside the execution unit. Probably microcode, which makes it not particularly useful for a low-HT-interference thread. A `movnti` store / reload would use up memory resources instead of ALU. That's definitely going to be variable latency so only usable with `rdtsc` (not calibration), but will sleep for ~500 cycles, so it's pretty light-weight. – Peter Cordes Aug 13 '17 at 04:15
  • `mfence` is 3 uops, 33c throughput on Haswell. Maybe even more friendly than spinning on `pause` on pre-Skylake. `rdrand` is 16 uops, one per ~460c throughput on Skylake. – Peter Cordes Aug 13 '17 at 04:18
  • Actually, with competition from the "main" thread, I don't think you could just calibrate a loop unless you're willing to accept some significant variability. You'd want to unroll it several times to reduce uops. `pause` is 4 uops for a 2 byte instruction, so it will easily fill up the uop cache lines for an aligned loop with much unrolling. That's maybe ok, as long as there's no penalty for one thread running from the legacy decoders while the other runs from the uop cache. I assume not. – Peter Cordes Aug 13 '17 at 04:25
  • 1
    @PeterCordes - significant variability is more or less OK, especially if it is uncorrelated with what the main thread is doing. Imagine it is used for low-overhead method timing: variability is mostly OK if uncorrelated. It seems like at least some types of instructions wouldn't suffer too much competition at least in some scenarios (e.g., floating point instructions used to sleep while the main code is running integer code)... but something like Skylake `pause` really seems close to ideal. Pre-Skylake it's less clear. – BeeOnRope Aug 13 '17 at 04:34

1 Answers1

1

Some random musing on the subject.

So you want to have a time stamp on a 100 MHz sample, that means that on a 4GHz cpu you have 40 cycles between each call.

The timer thread busily reads the real time clock (RTDSC???) but can't use the save method with cpuid as that takes 100 cycles. The old real time clock has a latency of around 25(and a throughput of 1/25), there might be a slightly newer, slightly more accurate with slightly more latency timer (32 cycles).

  start:
  read time (25 cycles)
  tmp = time - last (1 cycle)
  if tmp < sample length goto start
  last += cycles between samples
  sample = time
  goto start

In a perfect world the branch predictor will guess right every time, in reality it will mispredict randomly adding 5-14 cycles to the loops 26 cycles due to variance in the read time cycles.

When the sample is written the other thread will have its instructions cancelled from the first speculative loads from this cache line (remember to align to 64 byte for the sample position so no other data is affected). And the load of the sample time stamp starts over after a delay of ~5-14 cycles depending on where the instructions come from, the loop buffer, micro-ops cache or I-cache.

So a mimimum of 5->14 cycles / 40 cycles performance will be lost, in addition to half the cpu being used by the other thread.

On the other hand reading the real time clock in the main thread would cost ...

~1/4 cycle, the latency will most likely be covered by other instructions. But then you can't vary the frequency. The long latency of 25 cycles could be a problem unless some other long latency instructions precede it.

Using a CAS instruction (lock exch???) might partly solve the problem as the loads then shouldn't cause a reissue of the instruction, but instead results in a delay on all following reads and writes.

Surt
  • 15,501
  • 3
  • 23
  • 39
  • Hi Surt. What are you referring to by "the old real time clock"? It would be very nice if there was any hardware clock with a latency of 6 cycles, or are you referring to something like `clock()` that is updated by the OS? About `rdtsc` - the `cpuid` probably isn't strictly necessary here: it's mostly useful when you are issuing that instruction "inline" with the code being measured, in order to prevent reodering into/out of the timed region, but if you are just issuing a chain of `rdtsc` on a helper thread I expect to get reasonable values (and such could be tested). – BeeOnRope Aug 17 '17 at 22:41
  • There are 2 versions of the RTDSC(sp?) instruction. – Surt Aug 18 '17 at 05:48
  • 1
    Yes, `rdtsc` and the newer `rdtscp`. The latter includes more synchronization and also reads the value of an MSR at the same time (usually this has the CPU number). – BeeOnRope Aug 18 '17 at 13:59
  • ... but none of them take close to 6 cycles (at least back-to-back) on most modern x86 archs (a few exotic ones like some Xeon Phi models excluded). – BeeOnRope Aug 18 '17 at 17:39
  • I see my memory was a bit dated :) with 25 respectively 32 cycles on a Intel Skylake (according to Agner Fog). – Surt Aug 18 '17 at 17:52