Why does the time increase when using more threads in OpenMP

Question

What I try to do here is to understand OpenMP, so I wrote a simple program which compares the calculating times of parallelization for an matrix-vector multiplication. It is running with different sizes for the matrix (1024,2048,8192), with a different amount of threads (1,2,4,8) and with different scheduling strategies (static, dynamic, guided). I ran the program on a machine with two cores and 4 threads.

The times are:
Time for 1 threads with 1024 entries and scheduling 0: 26720 ticks
Time for 1 threads with 8192 entries and scheduling 0: 1486755 ticks
Time for 2 threads with 1024 entries and scheduling 0: 159161 ticks
Time for 2 threads with 8192 entries and scheduling 0: 22254787 ticks

But that does not make sense the the amount of cpu ticks increases around 5 to 15 times when increasing the threads from one to two. The times are a little better for 4 and 8 Threads again.

The code is

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_thread_num() 0
#endif

void matrix(unsigned int n)
{
  // The big arrays we want on the heap
  float *matrix = (float *)malloc(sizeof(float) * n * n);
  float *vector = (float *)malloc(sizeof(float) * n);
  float *result = (float *)malloc(sizeof(float) * n);

#pragma omp parallel for
  // initialize matrix
  for (int row = 0; row < n; row++)
  {
    for (int column = 0; column < n; column++)
    {
      *(matrix + (row * n) + column) = rand();
    }
  }

// initialize vectors
#pragma omp parallel for
  for (int row = 0; row < n; row++)
  {
    *(vector + row) = rand();
    *(result + row) = 0;
  }

// multiply
#pragma omp parallel for
  for (int row = 0; row < n; row++)
  {
    for (int column = 0; column < n; column++)
    {
      float resultat = *(matrix + (row * n) + column) * *(vector + column);
      *(result + row) += resultat;
    }
  }
}

int main()
{
  time_t t_t;

  // Initialisieren Zufallsgenerator
  srand((unsigned)time(&t_t));

  unsigned int threads[] = {1, 2, 4, 8};
  unsigned int amounts[] = {1024, 2048, 8192};
  omp_sched_t schedules[] = {omp_sched_static,
                             omp_sched_dynamic,
                             omp_sched_guided};

  size_t size_threads = sizeof(threads) / sizeof(threads[0]);
  size_t size_amounts = sizeof(amounts) / sizeof(amounts[0]);
  size_t size_schedules = sizeof(schedules) / sizeof(schedules[0]);

  // Anzahl Threads variieren
  for (int t = 0; t < size_threads; t++)
  {
    omp_set_num_threads(threads[t]);
    for (int a = 0; a < size_amounts; a++)
    {
      for (int s = 0; s < size_schedules; s++)
      {
        omp_set_schedule(schedules[s], 0);
        clock_t start_t = clock();
        matrix(amounts[a]);
        clock_t end_t = clock();
        printf("Time for %d threads with %d entries and scheduling %d: %ld ticks\n\a", threads[t], amounts[a], s, (end_t - start_t));
      }
    }
  }

  return 0;
}

Is there a mistake in my code or an other explanation for this behavior?

Edit: I also tried the gettimeofday() function like

  struct timeval start_time;
  struct timeval end_time;

  ...
  gettimeofday(&start_time, NULL);
  matrix(amounts[a]);
  gettimeofday(&end_time, NULL);
  ...

  printf("Time for %d threads with %d entries and scheduling %d: %f s\n\a", threads[t], amounts[a], s, (double)(end_time.tv_sec - start_time.tv_sec) + (double)(end_time.tv_usec - start_time.tv_usec)/1000000);

with the basically same results:

Time for 1 threads with 1024 entries and scheduling 0: 0.024589 s
Time for 1 threads with 8192 entries and scheduling 0: 1.393275 s
Time for 2 threads with 1024 entries and scheduling 0: 0.117452 s
Time for 2 threads with 8192 entries and scheduling 0: 25.067069 s

There is a mistake in your code. `matrix`, `vector` nor `result` are freed at the end of your `matrix()` function. A machine with 2 cores suggests an older machine with low RAM and 8192x8192 square matrix of floats is 256MB. You might be using swap space during your benchmarking which will show as awful performance. Also, I prefer to use [`clock_gettime()`](https://man7.org/linux/man-pages/man2/clock_gettime.2.html) with the `CLOCK_MONOTONIC` option nowadays. — Daniel Dearlove, Oct 25 '21 at 14:15

score 2 · Answer 1 · answered Oct 23 '21 at 13:14

2

I think this is related to the following question: OpenMP time and clock() give two different results

The thing is, probably clock() returns the total time spent on the CPU which becomes more as more threads are active. The real time spent is not much related to that number. I suggest to use gettimeofday() function to measure real times and compare your result with clock().

answered Oct 23 '21 at 13:14

questioner

158
1
9

Okay I see, but the results stay basically the same - Time for 1 threads with 8192 entries and scheduling 0: 1.393275 s - Time for 2 threads with 8192 entries and scheduling 0: 25.067069 s – Jakob Graf Oct 23 '21 at 13:34
As others have outlined already, generally the problem is that your code is likely to be memory bound (because you solely sum up values, which is a fast operation on modern CPUs). You should also be aware of possible NUMA architecture, which causes your code to stall. It is good advice to bind your threads to a specific HW core/thread and also let each thread allocate its own block of memory and use a "first touch policy" to take advantage of the thread binding. – questioner Oct 23 '21 at 13:38
Take this as a reference: https://moodle.rrze.uni-erlangen.de/pluginfile.php/12994/mod_resource/content/1/08_ccNUMA.pdf, I found it quite interesting. – questioner Oct 23 '21 at 13:45

score 1 · Answer 2 · answered Oct 23 '21 at 13:18

1

Another problem is that OpenMP has overhead to setup the threads and distribute the work among threads. You need a reasonable amount of work, otherwise overheads are bigger than the gain by parallelization.

answered Oct 23 '21 at 13:18

Laci

2,738
1
13
22

score 0 · Answer 3 · answered Oct 23 '21 at 13:26

0

In addition to what was said, memory contention probably also plays a role, influenced by the amount of L1 and L2 cache

answered Oct 23 '21 at 13:26

Tarik

10,810
2
26
40

Why does the time increase when using more threads in OpenMP

3 Answers3