Use of pthread increases execution time, suggestions for improvements

Question

I had a piece of code, which looked like this,

 for(i=0;i<NumberOfSteps;i++)
{
    for(k=0;k<NumOfNodes;k++)
    {

        mark[crawler[k]]++;
        r = rand() % node_info[crawler[k]].num_of_nodes;
        crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
    }
}

I changed it so that the load can be split among multiple threads. Now it looks like this,

for(i=0;i<NumberOfSteps;i++)
{
    for(k=0;k<NumOfNodes;k++)
    {            
        pthread_mutex_lock( &mutex1 );
        mark[crawler[k]]++;
        pthread_mutex_unlock( &mutex1 );

        pthread_mutex_lock( &mutex1 );
        r = rand() % node_info[crawler[k]].num_of_nodes;
        pthread_mutex_unlock( &mutex1 );

        pthread_mutex_lock( &mutex1 );
        crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
        pthread_mutex_unlock( &mutex1 );
   }
}

I need the mutexes to protect shared variables. It turns out that my parallel code is slower. But why ? Is it because of the mutexes ?

Could this possibly be something to do with the cacheline size ?

Adding mutexes in and of itself doesn't make your code parallel. You need to create multiple threads, and assign them work to do. The overhead of creating threads, and the overhead of mutex locking/unlocking to enforce mutual exclusion is significant. Think about partitioning the work in such a way that threads can work as long as possible independently - without having to synchronize. — amdn, Feb 04 '13 at 02:12
I have spawned multiple threads using pthread_create();. I did not assume that having mutexes will parallelise the code. But these threads have to cross boundaries and work on the same set of variables. — perceptron, Feb 04 '13 at 02:15
Which loop do you parallelize, for `i`(outer loop) or `k`(inner loop) or both? You'd better describe how parallelize your code with pthread. (I can't say for certain but `mutex1` lock/unlocking might be actually unnecessary...) — yohjp, Feb 04 '13 at 02:38
The entire code shown above is inside a function. I have created multiple threads that will execute the function parallely. I am using the mutexes because, the elments, mark, DataBlock and node_info are common to all the threads. The outer for loop can not be parallelised because each loop essentially is a crawl through a graph and the inner loop is the actual crawl. The entire crawl process is what is split across the processes. — perceptron, Feb 04 '13 at 02:42
In terms of what do the different thread access different data. The indicies used are the same, you say the variables in use (and protected via the mutex) are the same for all thread, so how is the work load assigend to the different threads. — alk, Feb 04 '13 at 07:52
The DataBlock is a pointer which is passed to the thread. Each thread gets a different pointer. Lets say I have an array, I send pointers to different portions of the array to each thread. @alk — perceptron, Feb 04 '13 at 16:32
Ahok, I see. So might 'False Sharing' (http://en.wikipedia.org/wiki/False_sharing) be an issue here? — alk, Feb 04 '13 at 18:27
`rand()` is bad as it modifies a shared global state (source of sharing). Better use the `drand48` generator with private state, i.e. the `erand48(3)` library function. — Hristo Iliev, Feb 05 '13 at 15:49
Oh cool. I will try the drand function and see if the peformance improves. — perceptron, Feb 05 '13 at 22:25
If I understand, you created `N` threads to run the loops with the same number of iterations ? So the data are processed `N` times more ? — Yann Droneaud, May 24 '13 at 14:43
BTW, if number of nodes is not a power of two value, there will be a bias, eg. some value of `r` will happen more often ... that's the danger of modulo. http://stackoverflow.com/questions/14614787/mathematics-behind-modulo-behavor — Yann Droneaud, May 24 '13 at 14:45

score 0 · Answer 1 · edited May 23 '17 at 12:04

You are not parallelizing anything but the loop heads. Everything between lock and unlock is forced to be executed sequentially. And since lock/unlock are (potentially) expensive operations, the code is getting slower.

To fix this, you should at least separate expensive computations (without mutex protection) from access to shared data areas (with mutexes). Then try to move the mutexes out of the inner loop.

You could use atomic increment instructions (depends on platform) instead of plain '++', which is generally cheaper than mutexes. But beware of doing this often on data of a single cache line from different threads in parallel (see 'false sharing').

AFAICS, you could rewrite the algorithm as indicated below with out needing mutexes and atomic increment at all. getFirstK() is NumOfNodes/NumOfThreads*t if NumOfNodes is an integral multiple of NumOfThreads.

for(t=0;t<NumberOfThreads;t++)
{
    kbegin = getFirstK(NumOfNodes, NumOfThreads, t);
    kend   = getFirstK(NumOfNodes, NumOfThreads, t+1);

    // start the following in a separate thread with kbegin and kend 
    // copied to thread local vars kbegin_ and kend_

    int k, i, r;
    unsigned state = kend_; // really bad seed

    for(k=kbegin_;k<kend_;k++)
    {
        for(i=0;i<NumberOfSteps;i++)
        {
            mark[crawler[k]]++;
            r = rand_r(&state) % node_info[crawler[k]].num_of_nodes;
            crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
        }
    }
}
// wait for threads/jobs to complete

This way to generate random numbers may lead to bad random distributions, see this question for details.

Use of pthread increases execution time, suggestions for improvements

1 Answers1