Array sum with 4 threads is slower than 1 thread

Question

I'm writing a program to sum N numbers in an array. To do this, the program may split its execution in a number of threads, each of them summing N/numThreads positions of the array.

However, the fastest execution time I've got so far was using 1 thread. Using 4 threads may take as much as twice the time! I've thought a lot about this problem (including the possibility of false sharing), but I just can't think of a solution. I really can't see where the false sharing is occurring, if at all.

I'm running Linux Mint and compiling using GCC by typing gcc spinlock.c -o spinlock -lpthread -lm My code is below. Please feel free to compile it and play around with it. You may change the value of N (number of elements in array) by changing the Define in the program and you may pass any number of threads as arguments for the program:

#define _GNU_SOURCE

#include <pthread.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <ctype.h>
#include <sys/types.h>
#include <inttypes.h>
#include <sched.h>
#include <math.h>
#include <time.h>

#define N 10000000

long sumTotal = 0; /* Global sum variable */
int lock = 0;    /* Lock for sum variable access */

struct threadInfo {            /* Argument passed to the function the thread will execute */
    pthread_t     threadId;    /* Returned ID by pthread_create() */
    int           threadNum;   /* Thread number (from 1 to threadNum) */
    unsigned long sectionSize; /* Section size to operate */
    int8_t*       section;     /* Pointer to start of the array to be summed */
};

void acquire(int* _lock) {
    while (__sync_lock_test_and_set(_lock, 1));
}

void release(int* _lock) {
    *_lock = 0;
}

void sumNum(void* arg){
    struct threadInfo *tinfo = arg;
    long i, sum = 0;

    for (i=0; i < tinfo->sectionSize; i++) {
        sum += tinfo->section[i];
    }


    acquire(&lock);
    sumTotal += sum;
    release(&lock);
}

int main(int argc, char** argv)
{
    int8_t* numbers;
    unsigned long sectionSize, currNum;
    unsigned int currThr, numThreads;
    int err, num_cores, i, core_interval, t1, t2;
    double arithmetics;
    pthread_attr_t attr;
    struct threadInfo *tinfo;

    /* Invalid number of arguments: */
    if (argc != 2){
        fprintf(stderr, "Usage: %s [threads]\n", argv[0]);
        exit(-1);
    }

    /* Converts argv to int: (K) */
    numThreads = (int) strtol(argv[1], (char**) NULL, 10);
    sectionSize = N/numThreads;

    /* Randoms the seed: */
    srand(time(NULL));

    /* Fills array with numbers: */
    numbers = (int8_t*) malloc(N);
    for (currNum=0; currNum<N; currNum++){
        numbers[currNum] = (int8_t) (((rand())%199)-99);
    }

    /* Initializes struct containing default attributes of threads: */
    err = pthread_attr_init(&attr);
    if (err != 0){
        fprintf(stderr, "Failed.\n");
        exit(-2);
    }

    /* Allocates memory for pthread_create arguments: */
    tinfo = calloc(numThreads, sizeof(struct threadInfo));
    if (tinfo == NULL){
        fprintf(stderr, "Failed.\n");
        exit(-3);
    }

    /* Thread affinity operations: */
    num_cores = sysconf(_SC_NPROCESSORS_ONLN);
    cpu_set_t cpuset[num_cores];
    for (i=0; i<num_cores; i++){
        CPU_ZERO(&(cpuset[i]));
        CPU_SET(i+1, &(cpuset[i]));
    }
    core_interval = N/num_cores;

    t1 = clock();

    /* Initialize threads: */
    for (currThr=0; currThr<numThreads; currThr++){
        tinfo[currThr].threadNum = currThr + 1;
        tinfo[currThr].section = &(*(numbers+currThr*sectionSize));
        /* If this is the last thread, takes the remainder of the array:     */
        if (currThr+1 == numThreads) {
            tinfo[currThr].sectionSize = N - (currThr*sectionSize);
        /* Other threads: */
        } else {
            tinfo[currThr].sectionSize = sectionSize;
        }

        /* Calculates to which core a thread should go: */
        arithmetics = (double) (currThr+1)*num_cores/numThreads;
        i = ceil(arithmetics);
        tinfo[currThr].threadId = pthread_self();
        /* This line, if uncommented, sets a group of threads to a specific cpu (core): */
        //pthread_setaffinity_np(tinfo[currThr].threadId, sizeof(cpu_set_t), &cpuset[i-1]);

        /* Creates the thread: */
        err = pthread_create(&tinfo[currThr].threadId,
                             &attr,
                             (void*) &sumNum,
                             &tinfo[currThr]);
        if (err != 0){
            fprintf(stderr, "Failed to create thread.\n");
            exit(-4);
        }
    }

    void* ret = NULL;

    /* Joins threads: */
    for (currThr=0; currThr<numThreads; currThr++) {
        pthread_join(tinfo[currThr].threadId, &ret);
    }

    t2 = clock();

    printf("Total sum: %ld. Total time: %f\n", sumTotal, (t2-t1*1.0)/CLOCKS_PER_SEC);
    return 0;
}

Thank you for your time, any help is appreciated!

Does pthread put threads on different cores? Is there any reason to think it would? — Lee Daniel Crocker, Apr 12 '17 at 18:21
@LeeDanielCrocker pthread knows nothing about cores. It delegates that level of thread management to the OS. — ThingyWotsit, Apr 12 '17 at 18:22
I measured the time it took for a thread to start and finish computation, before using the lock. It's really fast for the first thread and gets increasingly bigger for the other threads. You can try it yourself, if possible! I don't believe the lock is the problem. — Denolth, Apr 12 '17 at 18:26
All good! I had an error in my code when I copy and pasted. I corrected it now (just syntax) — Denolth, Apr 12 '17 at 18:29
Get rid of that thread affinity gunge - it's just muddying the waters. — ThingyWotsit, Apr 12 '17 at 18:30
Your locking isn't thread-safe at all. I suggest you use the synchronization primitives in C++11 (`std::mutex` or for lower level stuff `std::atomic`). — Cameron, Apr 12 '17 at 18:33
Nevertheless, I'm somewhat baffled by the overall lack of speedup. I can't see any of the usual cockups... — ThingyWotsit, Apr 12 '17 at 18:42
Another thing: Don't compare performance of debug builds (the default optimization level is `-O0`). — Cameron, Apr 12 '17 at 18:45
How many cores you got anyway? I'm still baffled - if you have 2 or more, I would expect a very noticeable improvement. — ThingyWotsit, Apr 12 '17 at 18:45
I tried running on a i3 and i5, no speedups. The use of test and set is part of an assignment, which is why I wrote it. Cool part: if you uncomment the line that sends each thread to a fixed core, it runs around 8% faster. I assume its the use of the cache. — Denolth, Apr 12 '17 at 18:51
@Denolth I havn't seen any numbers anywhere that demostrate any significant advantage by using that affinity stuff. — ThingyWotsit, Apr 12 '17 at 19:06
I tried multiple configurations of N, from 10 to 10^9. The single-threaded program is still faster. — Denolth, Apr 12 '17 at 19:28
My question was marked as duplicated and indeed I was measuring time in a wrong manner! The program is now around 3x faster when using 4 or more threads (since 2 cores are actually logical cores). Thanks! — Denolth, Apr 12 '17 at 20:09

Array sum with 4 threads is slower than 1 thread

0 Answers0