Mergesort pThread implementation taking same time as single-threaded

Question

(I have tried to simplify this as much as i could to find out where I'm doing something wrong.)

The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters.

On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close.

This is the (relevant) code:

//These are the 2 sorting functions, each thread will call merge_sort(..). Is this a problem? both threads calling same (normal) function?

void merge (int *v, int start, int middle, int end) {
    //dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
    //copies the original values into the 2 halves
    //then sorts them back into the v array
}

void merge_sort (int *v, int start, int end) {
    //recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
    //calls merge(start, middle, end) 
}

//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier)

void* mergesort_t2(void * arg) {
    t_data* th_info = (t_data*)arg;
    merge_sort(v, th_info->a, th_info->b);
    return (void*)0;
}

//in main I simply create 2 threads calling the above function

int main (int argc, char* argv[])
{
    //some stuff

    //getting the clock to calculate run time
    clock_t t_inceput, t_sfarsit;
    t_inceput = clock();

    //ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
    //the a and b are the range of values the created thread will have to sort
    pthread_t thread[2];
    t_data next_info[2];
    next_info[0].crt_depth = 1;
    next_info[0].a = 0;
    next_info[0].b = n/2;
    next_info[1].crt_depth = 1;
    next_info[1].a = n/2+1;
    next_info[1].b = n-1;

    for (int i=0; i<2; i++) {
        if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
            cerr<<"error\n;";
            return err;
        }
    }

    for (int i=0; i<2; i++) {
        if (pthread_join(thread[i], &status) != 0) {
            cerr<<"error\n;";
            return err;
        }
    }

    //now i merge the 2 sorted halves
    merge(v, 0, n/2, n-1);

    //calculate end time
    t_sfarsit = clock();

    cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
    delete [] v;
}

Output (on 1 million values):

Sort time (s): 1.294

Output with direct calling of merge_sort, no threads:

Sort time (s): 1.388

Output (on 10 million values):

Sort time (s): 12.75

Output with direct calling of merge_sort, no threads:

Sort time (s): 13.838

Solution:

I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning.

I've used the inplace_merge(..) function instead of my own and the program run times are as they should now.

Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version):

void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]

    //create the 2 new arrays
    int *st = new int[m-a+1];
    int *dr = new int[c-m+1];
    //copy the values
    for (int i1 = 0; i1 <= m-a; i1++)
        st[i1] = v[a+i1];
    for (int i2 = 0; i2 <= c-(m+1); i2++)
        dr[i2] = v[m+1+i2];

    //merge them back together in sorted order
    int is=0, id=0;
    for (int i=0; i<=c-a; i++)  {
        if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
            v[a+i] = st[is];
            is++;
        }
        else {
            v[a+i] = dr[id];
            id++;
        }
    }
    delete st, dr;
}

all this was replaced with:

inplace_merge(v+a, v+m, v+c);

Edit, some times on my 3ghz dual core cpu:

1 million values: 1 thread : 7.236 s 2 threads: 4.622 s 4 threads: 4.692 s

10 million values: 1 thread : 82.034 s 2 threads: 46.189 s 4 threads: 47.36 s

your `merge` is still sequential. What proportion of time is spent in the `merge_sort` and `merge` stages? — Adam, Jun 10 '14 at 10:21
Also just to make sure you're to re-inventing the wheel: http://www.cplusplus.com/reference/algorithm/merge/ — Adam, Jun 10 '14 at 10:25
You're not really saving much, and paying for what you do save with thread management. Further, your merge could be considerably simpler (and in fact using [`std::inplace_merge`](http://en.cppreference.com/w/cpp/algorithm/inplace_merge) would dramatically simplify this). And why are you even launching two threads at all? You can easily launch *one*, then use the *current* thread as the "other". — WhozCraig, Jun 10 '14 at 10:32
@Adam the final merge call takes 0.171 seconds, I don't think there's an easy way to check how much time it stays in those functions. I know the merge is sequential but I'm thinking that using 2 cores instead of just one should speed up a lot. — dany123, Jun 10 '14 at 10:49
@WhozCraig I have implemented this program using MPI too and I'm trying to keep the same structure to make a comparison between them. The full version of the program can use different numbers of threads. The basic ideea is this: IF not reached N threads: | I create new thread to sort left half of current range | I create new thread to sort right half of current range | Wait for both of them to finish | Merge the results . Now, if i want 4 threads, they will be created recursively from those 2 create new thread commands — dany123, Jun 10 '14 at 10:57
I used these exact merge(..) and merge_sort(..) functions on the MPI implementation and they sped up the runtime by almost 2x so I don't think those are the problem, I think there's something specific to pThreads I'm doing wrong because I don't have a lot of experience with them. — dany123, Jun 10 '14 at 11:02
@dany123 I understand how the division of work is laid out. My point was dedicating the *current* thread to doing nothing but *waiting* after launching two other threads is a waste. Rather than launch+launch+wait you could just as easily launch+work+wait and reduce the startup penalties by 50%. That was my point. — WhozCraig, Jun 10 '14 at 11:08
This was what I was referring to: [See it live](http://coliru.stacked-crooked.com/a/bbf70209e5f44b69). Dunno if it helps, but best of luck. — WhozCraig, Jun 10 '14 at 11:15
@WhozCraig Sorry, I understand now what you mean, I was hoping creating a thread that just waits isn't a big deal as (i think) it makes the code more readable. I did try to create 2nd thread for 2nd half, sort 1st half in current thread, wait for 2nd thread to end but sadly I get the exact same time. — dany123, Jun 10 '14 at 11:28
One possible problem is that [copying data using "for" loops is something _really_ slow](http://stackoverflow.com/questions/4729046/memcpy-vs-for-loop-whats-the-proper-way-to-copy-an-array-from-a-pointer). You may try using either [`memcpy`](http://www.cplusplus.com/reference/cstring/memcpy/) or [`std::copy`](http://www.cplusplus.com/reference/algorithm/copy/) — Bruno Ferreira, Jun 10 '14 at 19:23
@BrunoFerreira you're right, it was probably the sum of all the extra operations my code does compared to the one, good thing to remember for next time. — dany123, Jun 10 '14 at 19:31

score 0 · Answer 1 · edited May 23 '17 at 10:26

Note: since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. I left it for sake of those who might find the information useful.

clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock), which in case of multiple threads is the sum of CPU time for all threads. You need to measure elapsed, or wallclock, time. See more details in this SO question: C: using clock() to measure time in multi-threaded programs, which also tells what API can be used instead of clock().

In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. Nevertheless, it's still wrong to use CPU time (and so clock()) for performance measurement, even in serial programs; for one reason, if a program waits for e.g. a network event or a message from another MPI process, it still spends time - but not CPU time.

Update: In Microsoft's implementation of C run-time library, clock() returns wall-clock time, so is OK to use for your purpose. It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW.

Timing it with timeGetTime() instead gives the same number. Also I did try to time it myself and it doesn't look like this is the problem, I think my code somehow doesn't run the threads concurently. — dany123, Jun 10 '14 at 16:14
Oh, you run on Windows? I got misled by the use of pthreads and assumed Linux. — Alexey Kukanov, Jun 10 '14 at 16:51

score 0 · Accepted Answer · answered Jun 10 '14 at 18:51

There's one thing that struck me: "dynamically creates 2 new arrays[...]". Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. In particular the idea of doing microscopic array allocations sounds horribly inefficient. Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance.

Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". In other words, maybe you haven't reached n0 yet? I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high.

I was just about to post this, I've modified my program to use inplace_merge instead of my merge implementation and now the numbers look as they should. — dany123, Jun 10 '14 at 18:58

Mergesort pThread implementation taking same time as single-threaded

2 Answers2