How to parallelize nested loops with pthreads?

Question

I want to parallelize nested loops (I have four cores) in C by using pthreads. Inside the loops I'm simply assigning one value to every index of a 2 dimensional array.

When I tried to parallelize it with four threads it's actually slowing down my program by the factor of 3. I guess it's because the threads are somehow blocking each other.

This is the loop to be parallelized.

for ( i = 0; i < 1000; i++ ) 
      {
        for ( j = 0; j < 1000; j++ )
        {
          x[i][j] = 5.432;
        }
      }

I tried to parallelize it like this.

void* assignFirstPart(void *val) {
     for ( i = 1; i < 500; i++ )
    {
      for ( j = 1; j < 500; j++ )
      {              

        w[i][j] = 5.432;

      }
    }
}

void* assignSecondPart(void *val) {
     for ( ia = 500; ia < 1000; ia++ )
    {
      for ( ja = 500; ja < 1000; ja++ )
      {             

        w[ia][ja] = 5.432;


      }
    }
}

void* assignThirdPart(void *val) {
     for ( ib = 1; ib < 1000; ib++ )
    {
      for ( jb = 500; jb < 1000; jb++ )
      {            

        w[ib][jb] = 5.432;


      }
    }
}

void* assignFourthPart(void *val) {

     for ( ic = 500; ic < 1000; ic++ )
    {
      for ( jc = 500; jc < 1000; jc++ )
      {              

        w[ic][jc] = 5.432;                 

      }
    }
}

success = pthread_create( &thread5, NULL, &assignFirstPart, NULL );
    if( success != 0 ) {
        printf("Couldn't create thread 1\n");
        return EXIT_FAILURE;
    }

success = pthread_create( &thread6, NULL, &assignSecondPart, NULL );
    if( success != 0 ) {
        printf("Couldn't create thread 2\n");
        return EXIT_FAILURE;
    }

    success = pthread_create( &thread7, NULL, &assignThirdPart, NULL );
    if( success != 0 ) {
        printf("Couldn't create thread 3\n");
        return EXIT_FAILURE;
    }

success = pthread_create( &thread8, NULL, &assignFourthPart, NULL );
    if( success != 0 ) {
        printf("Couldn't create thread 4\n");
        return EXIT_FAILURE;
    }

pthread_join( thread5, NULL );
pthread_join( thread6, NULL );
pthread_join( thread7, NULL );
pthread_join( thread8, NULL );

So as I said, parallelizing it like this slows down my program massively, so I'm probably doing something completely wrong. I'm grateful for any advice.

This kind of code is memory bound and it is very unlikely that you can gain anything with multiple threads. And thread creation and synchronisation is expensive and explain the time increase. BTW, you could have a unique function and pass the starting index as a parameter. If you want to also pass the array or the constant, pass a pointer to an ad-hoc struct. — Alain Merigot, Jun 17 '19 at 11:48

score 1 · Answer 1 · answered Jun 17 '19 at 11:50

assignThirdPart overlaps with the indices of the two previous callbacks. Your loop conditions make little sense, you should be splitting up the 1000 iterations of the outer-most loop in 3, like:

for ( i = 0; i < 333; i++ ) // thread 1
...
for ( i = 333; i < 666; i++ ) // thread 2
..
for ( i = 666; i < 1000; i++ ) // thread 3
...

Also i = 1 is not equivalent to i = 0.

That being said, this doesn't necessarily improve performance. Just copying data with no calculations will make the data cache the bottleneck on most computers. If you split this in 3 you might disturb the CPU's ability of optimal cache use - which is highly system-specific.

What you do when meddling with the inner iterator during parallelisation, is that you segment the whole area to be copied - instead of having it linear, you have one thread copy a bit here, another a bit there, which screws up caching completely. Please read Why does the order of the loops affect performance when iterating over a 2D array?

And then of course there's thread creation overhead which should also be taken in account when benchmarking.

Even if this is all done properly, it is not necessarily faster with 3 threads. Multi-threading isn't some magical "always best performance" powder that you can sprinkle over any arbitrary code to speed it up. Chewing through 1000 aligned chunks of data is something that a high-end CPU does very effectively single-thread.

score 0 · Answer 2 · answered Jan 19 '21 at 16:42

0

It looks like your using global vars.

If this is the case they have massive overheads when used with threads and will slow it down alot.

answered Jan 19 '21 at 16:42

Aaron

3
1

1

*they have massive overheads when used with threads* ??? What is that based on? – Andrew Henle Jan 19 '21 at 16:48

How to parallelize nested loops with pthreads?

2 Answers2