How can multithreading give a factor of speeedup that is larger than the number of cores?

Question

I am using pthreads with gcc. The simple code example takes the number of threads "N" as a user-supplied input. It splits up a long array into N roughly equally sized subblocks. Each subblock is written into by individual threads.

The dummy processing for this example really involves sleeping for a fixed amount of time for each array index and then writing a number into that array location.

Here's the code:

/******************************************************************************
* FILE: threaded_subblocks_processing
* DESCRIPTION:
* We have a bunch of parallel processing to do and store the results in a
* large array. Let's try to use threads to speed it up.
******************************************************************************/
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>

#define BIG_ARR_LEN 10000

typedef struct thread_data{
  int start_idx;
  int end_idx;
  int id;
} thread_data_t;

int big_result_array[BIG_ARR_LEN] = {0};

void* process_sub_block(void *td)
{
   struct thread_data *current_thread_data = (struct thread_data*)td;
   printf("[%d] Hello World! It's me, thread #%d!\n", current_thread_data->id, current_thread_data->id);
   printf("[%d] I'm supposed to work on indexes %d through %d.\n", current_thread_data->id, 
       current_thread_data->start_idx, 
       current_thread_data->end_idx-1);

   for(int i=current_thread_data->start_idx; i<current_thread_data->end_idx; i++)
   {
       int retval = usleep(1000.0*1000.0*10.0/BIG_ARR_LEN);
       if(retval)
       {
         printf("sleep failed");
       }

       big_result_array[i] = i;
   }

   printf("[%d] Thread #%d done, over and out!\n", current_thread_data->id, current_thread_data->id);
   pthread_exit(NULL);
}

int main(int argc, char *argv[])
{
   if (argc!=2)
   {
     printf("usage: ./a.out number_of_threads\n");
     return(1);
   }

   int NUM_THREADS = atoi(argv[1]);

   if (NUM_THREADS<1)
   {
     printf("usage: ./a.out number_of_threads (where number_of_threads is at least 1)\n");
     return(1);
   }

   pthread_t *threads = malloc(sizeof(pthread_t)*NUM_THREADS);
   thread_data_t *thread_data_array = malloc(sizeof(thread_data_t)*NUM_THREADS);

   int block_size = BIG_ARR_LEN/NUM_THREADS;
   for(int i=0; i<NUM_THREADS-1; i++)
   {
     thread_data_array[i].start_idx = i*block_size;
     thread_data_array[i].end_idx = (i+1)*block_size;
     thread_data_array[i].id = i;
   }
   thread_data_array[NUM_THREADS-1].start_idx = (NUM_THREADS-1)*block_size;
   thread_data_array[NUM_THREADS-1].end_idx = BIG_ARR_LEN;
   thread_data_array[NUM_THREADS-1].id = NUM_THREADS;

   int ret_code;
   long t;
   for(t=0;t<NUM_THREADS;t++){
     printf("[main] Creating thread %ld\n", t);
     ret_code = pthread_create(&threads[t], NULL, process_sub_block, (void *)&thread_data_array[t]);
     if (ret_code){
       printf("[main] ERROR; return code from pthread_create() is %d\n", ret_code);
       exit(-1);
       }
     }

   printf("[main] Joining threads to wait for them.\n");
   void* status;
   for(int i=0; i<NUM_THREADS; i++)
   {
     pthread_join(threads[i], &status);
   }

   pthread_exit(NULL);
}

and I compile it with

gcc -pthread threaded_subblock_processing.c

and then I call it from command line like so:

$ time ./a.out 4

I see a speed up when I increase the number of threads. With 1 thread the process takes just a little over 10 seconds. This makes sense because I sleep for 1000 usec per array element, and there are 10,000 array elements. Next when I go to 2 threads, it goes down to a little over 5 seconds, and so on.

What I don't understand is that I get a speed-up even after my number of threads exceeds the number of cores on my computer! I have 4 cores, so I was expecting no speed-up for >4 threads. But, surprisingly, when I run

$ time ./a.out 100

I get a 100x speedup and the processing completes in ~0.1 seconds! How is this possible?

A thread does not need to use a CPU when it is sleeping. The number of CPU cores in your machine does not limit the number of threads that can sleep at the same time. — Solomon Slow, Jul 06 '17 at 20:45
Ah! So I'll have to add some "real" work (like a huge loop) instead of my sleep command. Sleep doesn't block the CPU! — Atul Ingle, Jul 06 '17 at 20:48
@AtulIngle Even then you *might* get faster results to up to 8, if you have a hyperthreading CPU, like an i7. — Lasse Meyer, Jul 06 '17 at 20:51
@jameslarge would you like to convert your comment to an answer? After removing the sleep command and adding a long for loop instead I now see the expected result of no speed up when number of threads > number of cores. — Atul Ingle, Jul 06 '17 at 20:57

score 2 · Accepted Answer · answered Jul 06 '17 at 22:40

Some general background

A program's progress can be slowed by many things, but, in general, you can slow spots (otherwise known as hot spots) into two categories:

CPU Bound: In this case, the processor is doing some heavy number crunching (like trigonometric functions). If all the CPU's cores are engaged in such tasks, other processes must wait.
Memory bound: In this case, the processor is waiting for information to be retrieved from the hard disk or RAM. Since these are typically orders of magnitude slower than the processor, from the CPU's perspective this takes forever.

But you can also imagine other situations in which a process must wait, such as for a network response.

In many of these memory-/network-bound situations, it is possible to put a thread "on hold" while the memory crawls towards the CPU and do other useful work in the meantime. If this is done well then a multi-threaded program can well out-perform its single-threaded equivalent. Node.js makes use of such asynchronous programming techniques to achieve good performance.

Here's a handy depiction of various latencies:

Your question

Now, getting back to your question: you have multiple threads going, but they are performing neither CPU-intensive nor memory-intensive work: there's not much there to take up time. In fact, the sleep function is essentially telling the operating system that no work is being done. In this case, the OS can do work in other threads while your threads sleep. So, naturally, the apparent performance increases significantly.

Note that for low-latency applications, such as MPI, busy waiting is sometimes used instead of a sleep function. In this case, the program goes into a tight loop and repeatedly checks a condition. Externally, the effect looks similar, but sleep uses no CPU while the busy wait uses ~100% of the CPU.

Off topic, might be interesting to some: MPI does provide asynchronous I/O functions, which allow computation while data is being transferred. Alas, most HPC programs (especially distributed simulators) do not use these, and instead use synchronous I/O exclusively, and recommend users to invest in fast network infrastructure (typically InfiniBand). Which is wasteful. But, most programmers seem to find async I/O too complicated to use. Hopefully this will change in the future, so we get more value (computation) for our money (hardware, electricity, and cooling needed). — Nominal Animal, Jul 06 '17 at 23:19
@NominalAnimal: An easy compromise is to set MPI to use polling rather than busy waiting: this does not require any code modification. — Richard, Jul 06 '17 at 23:30
Oh, that's not the issue at all. It boils down to programmers not using algorithms that allow both computation and data transfer to occur at the same time. With asynchronous transfers, and sufficient computation between transfer initiation and when the results are needed, it is possible to get better throughput with multiple threads than with a single thread; mainly, when the per-thread working set fits in core-local cache, or can be kept more localized in memory than a larger dataset. For distributed computing, it allows linear time behaviour for *many* cores. — Nominal Animal, Jul 07 '17 at 00:49
(By better throughput, I mean that N threads can do *more* than N times the work of a single thread, per unit of real time elapsed.) — Nominal Animal, Jul 07 '17 at 00:58

How can multithreading give a factor of speeedup that is larger than the number of cores?

1 Answers1