-1

I'm in a notebook with Apple M1, which has 4 cores. I have a problem which can be solved in parallel by spawning millions threads dynamically. Since pthread_create has a huge overhead, I'm avoiding it by, instead, leaving 3 threads in the background. These threads wait for tasks to arrive:

void *worker(void *arg) {
  u64 tid = (u64)arg;

  // Loops until the main thread signals work is complete
  while (!atomic_load(&done)) {

    // If there is a new task for this worker...
    if (atomic_load(&workers[tid].stat) == WORKER_NEW_TASK) {

      // Execute it
      execute_task(&workers[tid]);

    }

  }

  return 0;
}

These threads are spawned with pthread_create once:

pthread_create(&workers[tid].thread, NULL, &normal_thread, (void*)tid)

Any time I need a new task to be done, instead of calling pthread_create again, I just select an idle worker and send the task to it:

workers[tid].stat = WORKER_NEW_TASK
workers[tid].task = ...

The problem is: for some reason, leaving these 3 threads on the background makes my main thread 25% slower. Since my CPU has 4 cores, I expected these 3 threads to not affect the main thread at all.

Why are the background threads slowing down the main thread? Am I doing anything wrong? Is the while (!atomic_load(&done)) loop consuming a lot of CPU power?

MaiaVictor
  • 51,090
  • 44
  • 144
  • 286
  • 1
    You're assuming the scheduler will set processor affinity on your four threads to exclusively live on selected cores. The scheduler can/will push activity to whatever cores it wants, and core switches ain't cheap (nor free). And as far as doing anything wrong, unless you have a *very* good reason to spin-loop on atomic flags, you would probably be better off using a cvar+mutex pair managing a work crew servicing your queue. – WhozCraig Jan 01 '22 at 23:11
  • I expected downvotes on this question (since I really couldn't make it much better given my present knowledge), but I'm still glad I asked it. @WhozCraig thanks for the hint, I'll look into that direction. "unless you have a very good reason to spin-loop on atomic flags, you would probably be better off using a cvar+mutex " - That's not the kind of knowledge you can easily find on the internet, really takes expertise. Isn't that what SO is for? – MaiaVictor Jan 01 '22 at 23:19
  • 1
    Multi-threaded programming is indeed an "expertise" thing, to be sure, especially in C. A student text on POSIX threads will have a good reference on how cvar+mutex work crew systems can be implemented. "Programming with POSIX Threads" by Butenhof was the text I learned it from 22 years ago (and I still have that text on my shelf in the library). I'm sure there are plenty of online examples as well; I just prefer flipping pages rather than clicking mice buttons. – WhozCraig Jan 01 '22 at 23:27
  • @WhozCraig would you like to help me on this task? I can pay for your assistance in answering these kinds of questions that I might have. I don't need any major dedication, just taking a time to point me in the right direction when these questions come up would be very useful, and I can pay whatever your hourly rate is. – MaiaVictor Jan 01 '22 at 23:36
  • 2
    Well, Ohio State is current getting creamed by Utah, so I'm a little pre-occupied right now. I'll see if I can find some examples online. It is, after all, new years day. If you really wanted to just get what you have semi-running you could always try jostling the [core affinity](https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html) when you configure your threads. As I said, personally the concept of spin-waiting just rubs me the wrong way, but there are times when it is needed (lockless paradigms, for example). – WhozCraig Jan 01 '22 at 23:42
  • @WhozCraig okay, I expected you'd be busy. Thanks anyway! All I need is a lightweight way to spawn a thread to do a task, but with `pthread_create` being so heavy and with my spinlock-based pool being conceptually wrong, I'm out of ideas. I'll be reading these references over the next hours, I hope the right solution clicks eventually. – MaiaVictor Jan 01 '22 at 23:45
  • 2
    @MaiaVictor Have your threads wait for work to do in a blocking fashion instead of burning up CPU cycles. You can use pthread condition variables to achieve that. – nos Jan 02 '22 at 01:35

1 Answers1

2

The issue is that I had a thread like this:

typedef struct {
  atomic_int has_work;
} Worker;

Worker workers[MAX_THREADS];

void *worker(void *arg) {
  u64 tid = (u64)arg;
  while (1) {
    if (atomic_load(workers[tid].has_work)) {
      // do stuff
    }
  }
}

(...)

Notice how I used an atomic flag, has_work, to "wake up" the worker. As pointed by @WhozCraig on his comment, it may not be a good idea to spin-look on atomic flags. This, if I understand correctly, is computation-intensive, which overwelms the CPU. His suggestion was to use mutex with conditional vars, as described on "Programming with POSIX Threads" by Butenhof, section 3.3. This stack overflow answer has a snippet showing the common usage of that pattern. The resulting code should look like this:

typedef struct {
  int             has_work;
  pthread_mutex_t has_work_mutex;
  pthread_cond_t  has_work_signal;
}

Worker workers[MAX_THREADS];

void *worker(void *arg) {
  u64 tid = (u64)arg;
  while (1) {

    pthread_mutex_lock(&workers[tid].has_work_mutex);

    while (!workers[tid].has_work) {
      pthread_cond_wait(&workers[tid].has_work_signal, &workers[tid].has_work_mutex);
    }

    // do stuff

    pthread_mutex_unlock(&workers[tid].has_work_mutex);
  }
  return 0;
}

Notice how has_work was replaced to use a normal int instead of an atomic one, with a mutex guarding it. Then, a "conditional variable" is associated with that mutex. This allows us to use pthread_cond_wait to sort-of sleep the thread until another thread signals that "has_work" might be true. This has the same effect as the version using atomics, but is less CPU-hungry and should perform better.

MaiaVictor
  • 51,090
  • 44
  • 144
  • 286
  • 1
    Nicely done. This is right down the road of what I was describing. You can make this even more robust by using a single condition variable + mutex pair and a queue, and have all the worker threads monitor that cvar for signals, then, under the protection of the mutex, check the queue and pop off a work item if present. If no work is present (it can happen), just return to waiting. Regardless, what you have works, and you'll find it considerably more CPU friendly. – WhozCraig Jan 02 '22 at 09:02