2

I have a C program which creates two threads (apart from main), T1 and T2. T1 executes a function which issues an operation O1 and T2 executes a function which issues an operation O2.

void* f1() {
    O1();
    var = 0;
}
void* f2() {
    O2();
    var = 1;
}
int main(int argc, char **argv){
    pthread_t t1, t2;
    int var;

    pthread_create(&t1, NULL, &f1, NULL);
    pthread_create(&t2, NULL, &f2, NULL);

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);

    printf("var = %d\n", var);

    return 0;
}

t1 and t2 each get assigned to different physical cores. The objective of this program is to check which operation was faster by inspecting the value of var after both the threads have finished executing. This would require that O1() and O2() get run at the exact same time (or with a very slight tolerable difference in the order of few cycles) in parallel on both cores. How can I go about ensuring this?

Edit: Based on Peter Cordes' suggestion, I've modified f1() and f2() to read the timestamp for synchronized execution of O1() and O2().

void* f1() {
    t1 = rdtsc();
    while(t1 != 0){
        t1 = rdtsc();
    }   
    printf("t1 = %d\n", t1);
    O1();
    var = 0;
}
void* f2() {
    t2 = rdtsc();
    while(t2 != 0){
        t2 = rdtsc();
    }   
    printf("t2 = %d\n", t2);
    O2();
    var = 1;
}

However, t2 gets printed on the console much after t1 does. I guess this suggests that rdtsc has looped over to 0 in f2() and doesn't result in a synchronized execution of O1() and O2(). Thread barriers didn't offer the granularity of synchronization I require.

  • 3
    *The objective of this program is to check which operation was faster by inspecting the value of var after both the threads have finished executing.* - I hope O1 and O2 take much longer than the out-of-order exec windows size, and the inter-core latency for an RFO (Read For Ownership) for a writer to get control of the cache line so it can write. Seems like it would be more reliable to record a timestamp with `rdtsc` after each piece of work, assuming your TSC is synced across cores, or that you record a start time for each one. – Peter Cordes Jul 23 '22 at 22:23
  • (I'm guessing you're on x86; if not, other ISAs might or might not have a high-precision timer you can read.) – Peter Cordes Jul 23 '22 at 22:31
  • @PeterCordes please check the edit – Sathvik Swaminathan Sep 09 '22 at 18:01
  • Your code waits for up to 2^64 clock ticks? Or is `t1` a narrow type that truncates the TSC to 16 or 32 bits? Anyway, `rdtsc` throughput is about one per 25 core clock cycles on modern Intel (https://uops.info), so you're very likely to wrap the low half without seeing the low half be exactly zero. And this can happen differently in different threads, so there's a large chance that your two threads run at different times. – Peter Cordes Sep 10 '22 at 01:24
  • If you spin on the TSC *wrapping* (`(uint32_t)tsc_current < (uint32_t)tsc_previous`), that might work, although it's still somewhat coarse. Just *recording* TSCs, not spin-waiting, and not using a store to determine the winner, would make more sense, like I said. – Peter Cordes Sep 10 '22 at 01:25

2 Answers2

3

f1 and f2 will be certainly called with a small delay in practice on most platforms, but the delay is dependent of the hardware, the operating system (OS) and especially its scheduler. Theoretically, it is not possible to guarantee that the two functions are always started at the same time on all platforms. Indeed, the OS scheduler is free to schedule the threads on the same core and even if you bound threads to core, the thread can be interrupted at any time (eg. by a higher-priority task). Furthermore, core clocks are not strongly synchronized on most modern processors. That being said, a barrier is clearly sufficient in practice to make functions run approximately at the same time (with a granularity close to few microsecond on most systems, possibly even less). Pthread provide such a feature (see pthread_barrier_init and pthread_barrier_wait for example). Note that a spin-wait might be needed for a better precision (typically 1-10 ns, possibly a slightly less regarding the hardware). AFAIK it is not possible to synchronize thread with a precision better than several dozens of cycles of x86 processors. This is because modern processors are running instructions in a parallel and out-of-order way with a quite long complex pipeline and any inter-core synchronization is particularly slow (typically because of the long path to take, the cache coherence protocol, and fundamental physics laws).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    *core clocks are not strongly synchronized on most modern processors* - They are on Intel *client* chips (mainstream laptop/desktop, not many-core used in Xeons). Not sure about AMD or Apple or other ARM. Any non-sleeping core runs at the same frequency on my i7-6700k for example. Not that this timing mechanism (of seeing which store commits from the store buffer to L1d cache second) seems like a good idea, but if you want two functions to be competing with each other (for memory bandwidth) while you time them both with `rdtsc`, yeah a barrier could make sense. – Peter Cordes Jul 23 '22 at 22:27
  • 1
    When you say a spin-*lock*, what do you mean? I'd have gone with having both threads spin-*wait* for the same variable to change value, in hopes that their share-requests would be satisfied from the same store. (Without a `pause` in the loop, so they're probably checking at the same time. Both will probably need to recover from a memory-order mis-speculation, but that's probably about the same on each.) – Peter Cordes Jul 23 '22 at 22:30
  • @PeterCordes Interesting, on my i5-9600KF I see cores running at different frequency: 3700 MHz for most cores and 1-2 running at 800 MHz currently. AFAIK, the core frequency is a multiple of a base clock frequency (100 MHz on my machine) on Intel processors (at least recent client processors). This base clock is a constant but it can be changed in the BIOS. – Jérôme Richard Jul 23 '22 at 23:17
  • @PeterCordes For the spin-lock this is exactly the kind of thing I had also in mind though the name may not be great. "spin-wait" is better indeed (fixed). Both threads reads for an atomic flag and the main thread can just switch the state of the flag after having submitted the threads (this assume at least 3 cores available on the machine to properly work). I expect the threads not to be perfectly awaken at the same time because of the different bus latency (especially with a ring buffer) but I do not know a lot about this. – Jérôme Richard Jul 23 '22 at 23:25
  • 1
    Yeah, I'd expect some difference in observation time due to ring-buffer hops. IDK if the same message could get received by multiple cores or if they'd each do a separate read from L3 after write-back. If the TSC is synced across cores, you could even check on that by putting an `lfence; rdtsc` after the spin-wait barrier, and recording the results each thread got. – Peter Cordes Jul 23 '22 at 23:33
  • 1
    Re: clock sync across more cores: certainly possible that they changed something in CPUs with more cores than Skylake-client. But remember that clock frequency can change over very short intervals, so `grep MHz /proc/cpuinfo` can show different numbers. But if an infinite loop is running on any core, they're *all* 3900 or 4000 or 4200 MHz or whatever, depending on EPP settings and how many cores are busy. I think I remember seeing some kind of confirmation that client chips are (or used to be) designed without independent P-states for each core, so it's not just from my own observation. – Peter Cordes Jul 23 '22 at 23:36
  • 1
    @PeterCordes I tried to run a sequential computation and few cores are running in turbo @ 4450-4600 MHz (different frequencies that are not a multiple of the 100 MHz base) and the other ones are all at 3700 MHz. More precisely, the one doing the computation is always in turbo and 1-2 others are constantly switched (never the same). Thus, it indeed shows that the nominal frequency is the same for all cores but clocks in turbo are clearly not synchronized unless the frequency oscillate very quickly like PWM for screens. Interestingly, perf shows the others cores does nearly nothing. – Jérôme Richard Jul 24 '22 at 00:06
  • 1
    Ok, interesting, that sounds like evidence of independent clocks. (And come to think of it, the default EPP settings on my system never clock above 3900, even though nominal is 4000, and turbo is 4200. So maybe my past observations weren't looking at turbo.) Also, I think /proc/cpuinfo must show some kind of average, or measure on the fly every time you read it, maybe by timing some loop. On my near-idle i7-6700k w. EPP=performance, I just measured 4000 on the first 2 cores, 4160.897 on one, 145.204 on another, next 6 logical at 4000.000. Perhaps it caught it mid switch. – Peter Cordes Jul 24 '22 at 01:19
  • IIRC I have my BIOS set for a 1-2 core turbo limit of 4200, 3-4 (physical) core limit of 4000. The core measured at 4160.897 was the 3rd physical core on the 4c8t system, but the first core had probably gone to sleep again by the time it was running there... – Peter Cordes Jul 24 '22 at 01:21
0

The most accurate way of establishing whether O1() or O2() was faster would be a benchmark of each. There are very accurate ways to measure elapsed execution time, and certainly running O1() a few times and then O2() a few times and recording the start/stop times will give an accurate average answer. The more runs are included in the average, the more accurate the result will be, and the more certain one can be of the standard deviation on the result.

Relying on the OS somehow starting up threads instantaneously will not be as good. There is no guarantee that the OS will even run main() after the first thread start; some OSes will let the newly created thread run a while instead of its creating thread, just to see if it completes quickly (which, some do).

bazza
  • 7,580
  • 15
  • 22
  • Benchmarking short-ish functions is non-trivial, see [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) for possible gotchas involving warm-up effects, page faults, and compiler optimization (you need it on, but you also need to stop it from optimizing across a repeat loop, hoisting or sinking work out of it.) – Peter Cordes Aug 02 '22 at 22:34