0

I wrote the following very simple pthread code to test how it scales up. I am running the code on a machine with 8 logical processors and at no time do I create more than 8 threads (to avoid context switching). With increasing number of threads, each thread has to do lesser amount of work. Also, it is evident from the code that there are no shared Data structures between the threads which might be a bottleneck. But still, my performance degrades as I increase the number of threads. Can somebody tell me what am I doing wrong here.

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int NUM_THREADS = 3;
unsigned long int COUNTER = 10000000000000;
unsigned long int LOOP_INDEX;

void* addNum(void *data)
{
    unsigned long int sum = 0;
    for(unsigned long int i = 0; i < LOOP_INDEX; i++) {
            sum += 100;
    }
    return NULL;
}

int main(int argc, char** argv)
{
    NUM_THREADS = atoi(argv[1]);
    pthread_t *threads = (pthread_t*)malloc(sizeof(pthread_t) * NUM_THREADS);
    int rc;

    clock_t start, diff;

    LOOP_INDEX = COUNTER/NUM_THREADS;        
    start = clock();

    for (int t = 0; t < NUM_THREADS; t++) {
        rc = pthread_create((threads + t), NULL, addNum, NULL);
        if (rc) {
             printf("ERROR; return code from pthread_create() is %d", rc);
             exit(-1);
        }
    }

    void *status;
    for (int t = 0; t < NUM_THREADS; t++) {
            rc = pthread_join(threads[t], &status);
    }

    diff = clock() - start;
    int sec = diff / CLOCKS_PER_SEC;
    printf("%d",sec);
}

Note: All the answers I found online said that the overhead of creating the threads is more than the work they are doing. To test it, I commented out everything in the "addNum()" function. But then, after doing that no matter how many threads I create, the time taken by the code is 0 seconds. So there is no overhead as such, I think.

archita
  • 1
  • 1
  • 1
    You'd probably get more accurate measurement if you measure wall clock time, while clock() measures cpu time. Also increase your resolution - 1 second is a very long time to a computer. – nos Aug 07 '15 at 15:49
  • Your `addNum()` function is a no-op, since it only changes local state, and doesn't return the result of its computation. A conforming compiler is free to optimize it away entirely, so your program runs for the time needed to create and join its threads. – EOF Aug 07 '15 at 16:29
  • @nos: I also used gettimeofday() to time the code which basically measures wall clock time. Also, as you suggested, I changed the time unit from sec to usec. Still I can see no improvement. Is this a wrongly constructed example. – archita Aug 07 '15 at 17:50
  • @EOF: If addNum() is a no-op, how can I change it to see the scaling? – archita Aug 07 '15 at 17:54
  • Honestly, `addNum()` is so trivial I can think of several ways the compiler can optimize it, from just returning immediately, to just multiplying `LOOP_INDEX * 100` to vectorizing the calculation. You're going to have to find a function the compiler doesn't recognize, *and* return a result. To see if you've been successful, have the compiler generate assembly for you (`-S` in gcc). – EOF Aug 07 '15 at 18:07
  • Adding up to *@nos*'s comment: Related if not a duplicate: http://stackoverflow.com/q/2962785/694576 – alk Aug 08 '15 at 08:49

1 Answers1

0

clock() counts CPU time used, across all threads. So all that's telling you is that you're using a little bit more total CPU time, which is exactly what you would expect.

It's the total wall clock elapsed time which should be going down if your parallelisation is effective. Measure that with clock_gettime() specifying the CLOCK_MONOTONIC clock instead of clock().

caf
  • 233,326
  • 40
  • 323
  • 462