With very short sleep times, why does a thread only finish zero or one iteration of printing before seeing the stop flag set?

Question

See the code below, AsyncTask creates a peer thread(timer) to increment a atomic variable and sleep for a while. The expected output is to print counter_ 10 times, with values ranging from 1 to 10, but the actual result is strange:

It seems like that the actual result is random, sometimes it's printed once, sometimes it's not printed at all.
Further, I found that when I changed thread sleep time(both peer thread and main thread) to seconds or milliseconds, the program worked as expected.

#include <atomic>
#include <thread>
#include <iostream>

class AtomicTest {
 public:
  int AsyncTask() {
    std::thread timer([this](){
      while (not stop_.load(std::memory_order_acquire)) {
        counter_.fetch_add(1, std::memory_order_relaxed);
        std::cout << "counter = " << counter_ << std::endl;
        std::this_thread::sleep_for(std::chrono::microseconds(1)); // both milliseconds and seconds work well
      }
    });
    timer.detach();

    std::this_thread::sleep_for(std::chrono::microseconds(10));
    stop_.store(true, std::memory_order_release);
    return 0;
  }

 private:
  std::atomic<int> counter_{0};
  std::atomic<bool> stop_{false};
};

int main(void) {
  AtomicTest test;
  test.AsyncTask();
  return 0;
}

I know that thread switching also takes time, is it because thread sleep time too short?

My programme running environment:

Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.2.0)

Peter Cordes · Accepted Answer · 2023-05-17T04:09:44.530

Yes, easily plausible that stop_.store could run before the new thread has been scheduled to a CPU core, or soon after. So its first test reads the stop flag as true.

10 us is shorter than typical OS process-scheduling timeslices (often 1 or 10 ms) in case that's relevant. And only a couple orders of magnitude higher than inter-core latency for an atomic store becoming visible.

The results you describe are exactly what I'd expect for a timing-dependent program like this, written to detect which thread wins the race and by how much (with its slow << endl and sleep inside the writing thread.)

I definitely wouldn't expect it to always print 10 times, and it would be rare that'd ever happen due to thread startup overhead being a significant fraction of the 1 us sleep interval inside the printing thread.

BTW, your question was originally titled "A question about incrementing atomic variables?". But counter is only ever accessed from one thread. It's probably in the same cache line as the stop flag, but without contention from the main thread it's basically trivial, a very fast operation.

It's irrelevant to what you're doing; it could be a local non-atomic int inside the thread's lambda and you'd see the same timing effects. The significant things here are cout << endl which forces a flush of the stream (and thus a system call) even if you redirected to a file, and the this_thread::sleep_for().

If the write system call was to a terminal (not redirect to a file), it might even block while the terminal emulator drew on the screen, although for only a couple small writes there's probably a big enough buffer somewhere (probably inside the kernel) to absorb it.

An atomic increment probably takes a few nanoseconds, and being relaxed it's something AArch64 can handle very efficiently, overlapping much of that time with surrounding code. (Modern x86 can do an atomic increment about one per 20 clock cycles at best, and that includes a full memory barrier. I expect Apple M1 to handle it more cheaply when it doesn't need to be a barrier.)

Thank you for solving my confusion. About typical OS process-scheduling timeslices (often 1 or 10 ms) , do you have any source for your data? — maxentroy, May 17 '23 at 07:29
@maxentroy: Common defaults for Linux are/were HZ=100 or HZ=1000 (timer interrupt frequency), thus scheduling decision interval = 10 ms or 1 ms timeslices, the inverse of frequency. (Scheduling decisions can also be made on system calls.). I think other OSes typically make similar choices, since it's a good tradeoff between responsiveness / latency vs. throughput considerations of not spending too much time running the scheduler, and not losing too much throughput to cache misses after context switches. [How to know linux scheduler time slice?](https://stackoverflow.com/q/16401294) — Peter Cordes, May 17 '23 at 08:00
Modern Linux can use a "tickless" config without a fixed timer interrupt, but it still has to set a hardware timer for when to pre-empt user-space on the current core. Also, I'm over-simplifying the concept of a timeslice as just the timer-interrupt interval; schedulers can look at things like whether a process put itself to sleep before using up a full timeslice, and if so guess that it's interactive or whatever, and give it a priority boost when it is ready to wake up. — Peter Cordes, May 17 '23 at 08:01

With very short sleep times, why does a thread only finish zero or one iteration of printing before seeing the stop flag set?

1 Answers1