OpenMP takes more time than expected

Question

So, I am facing some difficulties using openMp. I am beginner and I do not know what I am doing wrong. This is a project for one of my courses at University, so I do not seek for the solution, rather for a hint or an explanation.

The project is to calculate the hamming distance between 2 strings that belong in different sets (lets say setA and setB). Those two sets may contain 100,1000 or 10000 strings which each one of them is composed by the same length of chars.

My problem is that despite the fact that I have decreased the execution time of the parallel program it still takes more time than the serial algorithm.

So, I attach my codes for showing what I have done so far.

serial C code.

void main(int argc,char **argv)
{

//initialize sets' number and string's length
int m=atoi(argv[1]),n=atoi(argv[2]),I=atoi(argv[3]);
int i=0,j=0,l=0,TotalHammingDistance=0,count;

//creation of 2-dimentional matrices for setA and setB
char **setA = malloc(m * sizeof(char *)); // Allocate row pointers
for(i = 0; i < m; i++)
    setA[i] = malloc((I+1) * sizeof(char));  // Allocate each row separatel

char **setB = malloc(n * sizeof(char *)); // Allocate row pointers
for(i = 0; i < n; i++)
    setB[i] = malloc((I+1) * sizeof(char));  // Allocate each row separatel

// initialize matrices with random string (0 and 1)
for (i=0;i<m;i++){
    for(j=0;j<I;j++){
        setA[i][j]="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"[rand() % 62];
    }
    setA[i][I]='\0';
}

for (i=0;i<n;i++){
    for(j=0;j<I;j++){
        setB[i][j]="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"[rand() % 62];
    }
    setB[i][I]='\0';
}

//creation of m*n matrix to store all hamming distances and initialize it
int **HamDist = malloc(m * sizeof(int *)); // Allocate row pointers
for(i = 0; i < m; i++)
  HamDist[i] = malloc(n * sizeof(int));

for(i=0;i<m;i++){
    for(j=0;j<n;j++){
        HamDist[i][j]=0;
    }
}

clock_t start=clock();
//Calculate hamming distance for all combinations of the strings
for (i=0;i<m;i++){
    for(j=0;j<n;j++){
        count=0;
        for(l=0;l<=I;l++) {
            if (setA[i][l] != setB[j][l])
                count++;
        }
        HamDist[i][j]=count;
        TotalHammingDistance+=HamDist[i][j];
    }
}
clock_t end =clock();
double hamm_time=(double)(end-start)/CLOCKS_PER_SEC;

printf("\n|Total Hamming execution time= %f",hamm_time);
printf("\n\n*|The Total Hamming Distance is: %d\n",TotalHammingDistance );
}

OpenMp C code

void main(int argc,char **argv)
{
//initialize sets' number and string's length
    int m=atoi(argv[1]),n=atoi(argv[2]),I=atoi(argv[3]);
     int i=0,j=0,TotalHammingDistance=0, tid,nthreads,chunk;

    //creation of 2-dimentional matrices for setA and setB      
    char **setA = malloc(m * sizeof(char *)); // Allocate row pointers
    for(i = 0; i < m; i++)
      setA[i] = malloc((I+1) * sizeof(char));  // Allocate each row separatel

    char **setB = malloc(n * sizeof(char *)); // Allocate row pointers
    for(i = 0; i < n; i++)
      setB[i] = malloc((I+1) * sizeof(char));  // Allocate each row separatel

    // initialize matrices with random string (0 and 1)
    for (i=0;i<m;i++){
        for(j=0;j<I;j++){
            setA[i][j]="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"[rand() % 62];
        }
        setA[i][I]='\0';
    }

    for (i=0;i<n;i++){
        for(j=0;j<I;j++){
            setB[i][j]="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"[rand() % 62];
        }
        setB[i][I]='\0';
    }

    //creation of m*n matrix to store all hamming distances and initialize it
    uint16_t **HamDist = malloc(m * sizeof(uint16_t *)); // Allocate row pointers
    for(i = 0; i < m; i++)
      HamDist[i] = malloc(n * sizeof(uint16_t));

    for(i=0;i<m;i++){
        for(j=0;j<n;j++){
            HamDist[i][j]=0;
        }
    }

    printf("\n HamDist set \n" );
    int count=0;
    clock_t start=clock();

    omp_set_num_threads(2);
    #pragma omp parallel shared(setA, setB,HamDist ) 
    {
        int k,p,l,count=0;
        #pragma omp for schedule(dynamic, 10000)        
        for (k=0;k<m;k++){
             for(p=0;p<n;p++){
                count=0;
                for(l=0;l<=I;l++){
                    if (setA[k][l] != setB[p][l]){
                        count++;
                    }
                }
                HamDist[k][p]=count;
            }
        }
    }

    clock_t end =clock();
    double per_time=(double)(end-start)/CLOCKS_PER_SEC;
    printf("\n|Total time for two sets= %f",per_time);

    /**/
    for (i=0;i<m;i++){
          for(j=0;j<n;j++){
              TotalHammingDistance+=HamDist[i][j];
          }
    }

printf("\n|Total execution time= %f",per_time);
printf("\n\n*|The Total Hamming Distance is: %d\n",TotalHammingDistance );
}

The execution time that I receive is around 42.011104 for the openmp program and about 32.876482 for the serial algorithm (m=n=10000 and I= 100, where m,n describes the number of strings at each set and I is the string length)

I strongly believe that the parallel program should takes less execution time. Any idea??

Thanks in advance!

How many processors/cores does the computer you're testing this on have? How long was the string that you got the benchmarking data from? If the string is too short, the overhead of creating threads will outweigh the advantage of distributing the workload. — ack, Mar 15 '18 at 23:29
Measure how much time you're spending in all those `malloc()` calls. And turn on *all* your compiler warnings. You're also likely spending a huge amount of time in `setB[i][j]="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"[rand() % 62];` With `m=n=10000` that's 100,000,000 calls to `rand()`. — Andrew Henle, Mar 15 '18 at 23:33
@AlexQuilliam My computer has 4 cores. I calculated the execution times for 2,4 and 6 threads. With 2 thread I had the results that is described at the post. The string has 100 chars length. The workload(or chunk size) we fount out that the best solution so far is given for 1000. — Sotiris Dimitras, Mar 15 '18 at 23:36
@AndrewHenle I do not measure the time for initializing the tables. I measure the time only for the hamming distance calculation. — Sotiris Dimitras, Mar 15 '18 at 23:38
You might want to set `omp_get_num_threads()` to `omp_get_max_threads()` rather than 2. — Davislor, Mar 15 '18 at 23:53
@Davislor If I increase the number of threads, the execution time also increases! — Sotiris Dimitras, Mar 15 '18 at 23:57
You could try to `COLLAPSE` the loop and perhaps use `STATIC` scheduling? — Davislor, Mar 16 '18 at 00:05
Since it looks like the overhead of thread synchronization is costing you more than parallelism gains you, look for ways to reduce it, maybe increasing the chunk size? — Davislor, Mar 16 '18 at 00:12
But, the big win would probably be turning `count++` into a reduction operation. You don’t want to synchronize increment operations on a `shared` variable; you want each thread to keep its own count and add them together at the end, with `reduction(+:count)`. — Davislor, Mar 16 '18 at 00:16
@Davislor I used **static** scheduling, but the execution time also increased. Moreover, the variable count is a private variable. It is being created (int count) inside parallel region. So, if I am not mistaken, I can not use it with reduction. — Sotiris Dimitras, Mar 16 '18 at 00:32
Possible duplicate of [OpenMP time and clock() calculates two different results](https://stackoverflow.com/questions/10673732/openmp-time-and-clock-calculates-two-different-results) — Zulan, Mar 16 '18 at 07:52
If your parallelized loop has 10000 iterations and you set the chunk size for OpenMP `schedule` to 10000, then all the iterations are effectively performed by a single thread only. Are you aware of this fact? — Daniel Langr, Mar 16 '18 at 09:21
@DanielLangr I was not until yesterday! However, I read the documentation carefully once again, and figure it out!! Thank you for make it clear to me! — Sotiris Dimitras, Mar 17 '18 at 20:52

score 2 · Accepted Answer · answered Mar 16 '18 at 00:52

Measuring multiprocessor performance is a bit more complicated but we can do a good approximation of "Does it work or not?" with time(1). If I do it with your code as it is (with GCC gcc-4.8.real (Ubuntu 4.8.5-2ubuntu1~14.04.1) 4.8.5 invoked with gcc -W -Wall -Wextra -O3 -fopenmp openmptest.c -o openmptest) I got

$ time ./openmptest 10000 10000 100

 HamDist set 

|Total time for two sets= 9.620011
|Total execution time= 9.620011

*|The Total Hamming Distance is: 1248788142

real    0m9.815s
user    0m9.700s
sys 0m0.116s

Where both, real and user are roughly the same value and also roughly the same as the normal version. If I remove schedule(dynamic, 10000) completely and let Openmp decide for itself I get

$ time ./openmptest 10000 10000 100
 HamDist set 

|Total time for two sets= 9.187761
|Total execution time= 9.187761

*|The Total Hamming Distance is: 1248788142

real    0m4.819s
user    0m9.265s
sys 0m0.112s

That is 5/9 instead of 9/9. If I set omp_set_num_threads(2) to 4 instead (I have four CPUs here.) I get

$ time ./openmptest 10000 10000 100
 HamDist set 

|Total time for two sets= 11.438243
|Total execution time= 11.438243

*|The Total Hamming Distance is: 1248788142

real    0m3.080s
user    0m11.540s
sys 0m0.104s

That is 3/11 < 5/9 < 9/9. So it works as expected if you let OpenMP do it itself. Removing omp_set_num_threads() gave no difference to the last try.

You have a very simple program where OpenMP's defaults work quite well. Fine-tuning OpenMP is a science in and of itself but for example @Davislor 's comment about using reduction seems to be a good one to start with.

BTW: You also have a lot of warnings, one of them is about shadowing count which you declared two times, one before the loop and one inside. You should get rid of all the warnings. It happens more often than not that a very significant information is hidden in between those dozens of warnings.

To tell the truth while compiling it (gcc -fopenmp -o openmptest openmptest.c) I had no warnings, apart from the variable count that was declared two times. Despite that, after correcting all the warnings and doing what you suggested, the time is significantly better! Thank you for your help! — Sotiris Dimitras, Mar 16 '18 at 09:48
@SotirisDimitras why did you have no warn...oh my fault, sorry. I used clang's "paranoid" mode `clang -Weverything ...` to generate the warnings because its static analyzer is better than GCC's and forgot to add the obligatory `-W -Wall -Wextra` to the GCC invocation line, so: my fault, sorry. — deamentiaemundi, Mar 16 '18 at 14:58

OpenMP takes more time than expected

1 Answers1