C OpenMP - Reduction scalability

Question

I'm testing the performance speedup of some algorithms when using OpenMP and one of then is not scaling. Am I doing something wrong?

PC Details:

Memory: 7,7 GiB
Processor: Intel® Core™ i7-4770 CPU @ 3.40GHz × 8
OS: Ubuntu 15.04 64-bit
gcc: gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

Code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>

int main(int argc, char **argv) {
  int test_size, i;
  double *vector, mean, stddeviation, start_time, duration;

  if (argc != 2) {
    printf("Usage: %s <test_size>\n", argv[0]);
    return 1;
  }

  srand((int) omp_get_wtime());

  test_size = atoi(argv[1]);
  printf("Test Size: %d\n", test_size);

  vector = (double *) malloc(test_size * sizeof(double));
  for (i = 0; i < test_size; i++) {
    vector[i] = rand();
  }

  start_time = omp_get_wtime();
  mean = 0;
  stddeviation = 0;
#pragma omp parallel default(shared) private(i)
  {
#pragma omp for reduction(+:mean)
    for (i = 0; i < test_size; i++) {
      mean += vector[i];
    }
#pragma omp single
    mean /= test_size;

#pragma omp for reduction(+:stddeviation)
    for (i = 0; i < test_size; i++) {
      stddeviation += (vector[i] - mean)*(vector[i] - mean);
    }
  }
  stddeviation = sqrt(stddeviation / test_size);
  duration = omp_get_wtime() - start_time;

  printf("Std. Deviation = %lf\n", stddeviation);
  printf("Duration: %fms\n", duration*1000);

  return 0;
}

Compilation line

gcc -c -o main.o main.c -fopenmp -lm -O3
gcc -o dp main.o -fopenmp -lm -O3

Results

$ OMP_NUM_THREADS=1 ./dp 100000000
166.224199ms

$ OMP_NUM_THREADS=2 ./dp 100000000
157.924034ms

$ OMP_NUM_THREADS=4 ./dp 100000000
159.056189ms

yeah, I thought about this, then I rewrote this code in Go and got 167ms, 84ms and 31ms... Don't you think C code should at least equal Go timings? — DiogoDoreto, Jun 03 '15 at 18:04
All looks good to me, but try printing the number of threads just to be sure openmp is really enabled. — Les, Jun 03 '15 at 18:57
@Les yes, it is. I printed `omp_get_num_threads()` inside parallel block and it have shown the right number of threads. — DiogoDoreto, Jun 03 '15 at 19:08
The benchmark is making a single pass over a lot of data and doing almost no work at all. It's probably completely memory bound. — Mysticial, Jun 03 '15 at 19:39
@Mysticial, I agree it's memory bandwidth bound but that does not mean it should not see some significant benefit from using multiple threads https://stackoverflow.com/questions/25179738/measuring-memory-bandwidth-from-the-dot-product-of-two-arrays. So I'm a bit surprised the OP sees essentially no scaling. But I don't have the time right now to think about this. — Z boson, Jun 04 '15 at 07:32
I cannot reproduce your results. Compiling your code with gcc 4.8 (same as you) on Ubuntu 15.04 (same as you) gives me, depending on the CPU, a speedup of between 1.13 and 1.83 for 2 threads - which is approximately what you can expect for a memory-bound program like yours (experiments like loop unrolling and replacing your floating-point operations by pure memory access confirmed that it actually is memory-bound). Your speed-up of 1.05 seems far too low, especially considering that I got the highest speed-up on the system that is most similar to yours. — mastov, Jun 06 '15 at 12:38
Could you try specifying the schedule explicitly for the for loops? I would recommend schedule(static, 16384). I wasn't able to reproduce your results with your code as is, however if I specify the schedule to be dynamic with a small chunk size, I am able to get no speedup. Perhaps your platform's default which is used when the schedule is left unspecified is not static. — user2548418, Jun 09 '15 at 00:43
@user2548418 with `schedule(static, 16384)` times got worse, but it's scaling: 1008ms, 535ms and 324ms — DiogoDoreto, Jun 10 '15 at 18:56
@DiogoDoreto perhaps in that case you could try other scheduling primitives and chunk sizes? https://software.intel.com/en-us/articles/openmp-loop-scheduling — user2548418, Jun 11 '15 at 16:45
@user2548418 I've tried many combinations... the times were larger than or equal your former suggestion — DiogoDoreto, Jun 11 '15 at 18:58
Have you verified that you're getting the number of threads you request? Are you doing exactly `OMP_NUM_THREADS=1 ./dp 100000000`? or do you do this is separate lines? If the second case you need `export OMP_NUM_THREADS=1 ./dp 100000000`. — Z boson, Jun 12 '15 at 10:53
@Zboson yes, I've verified. The command is in one line as shown in the results section, and I've even printed `omp_get_num_threads()` to make sure. — DiogoDoreto, Jun 12 '15 at 14:19
Here you can test your code, edit it, compile, and run and see for yourself http://coliru.stacked-crooked.com/a/3993175059df6430 that it scales. — Z boson, Jun 13 '15 at 16:02
Why don't you run some benchmark tests on your system and then see how it compares to what you expect with your system? I suggest [STREAM](http://www.cs.virginia.edu/stream/ref.html). You can set the number of threads and if it does not improve with STREAM then that may indicate your system has a hardware problem. How kind of memory does your system have and how is it installed? — Z boson, Jun 13 '15 at 16:10

Daniel Trebbien · Answer 1 · 2015-06-07T12:35:47.697

I am not reproducing your results with Ubuntu 14.04.2 LTS, gcc 4.8, and a 2.3 GHz Intel Core i7. Here are the results that I get:

$ OMP_NUM_THREADS=1 ./so30627170 100000000
Test Size: 100000000
Std. Deviation = 619920018.463329
Duration: 206.301721ms
$ OMP_NUM_THREADS=2 ./so30627170 100000000
Test Size: 100000000
Std. Deviation = 619901821.463117
Duration: 110.381279ms
$ OMP_NUM_THREADS=4 ./so30627170 100000000
Test Size: 100000000
Std. Deviation = 619883614.594906
Duration: 78.241708ms

Because the output listed in the "Results" section of your question could not match the output from the code as listed, you may be running an old version of your code.

I thought about possibly using X86 intrinsics within the parallel for loops, but examining the assembly output, gcc already uses SIMD instructions in this case. Without march options, I was seeing gcc use SSE2 instructions. Compiling with -march=native or -mavx, gcc would use AVX instructions.

EDIT: Running the Go version of your program, I get:

$ ./tcc-go-desvio-padrao -w 1 -n 15 -t 100000000
2015/06/07 08:26:43 Workers: 1
2015/06/07 08:26:43 Tests: [100000000]
2015/06/07 08:26:43 # of executions of each test: 15
2015/06/07 08:26:43 Time to allocate memory: 584.477µs
2015/06/07 08:26:43 ===========================================
2015/06/07 08:26:43 Current test size: 100000000
2015/06/07 08:27:05 Time to fill the array: 1.322556083s
2015/06/07 08:27:05 Time to calculate: 194.10728ms
$ ./tcc-go-desvio-padrao -w 2 -n 15 -t 100000000
2015/06/07 08:27:10 Workers: 2
2015/06/07 08:27:10 Tests: [100000000]
2015/06/07 08:27:10 # of executions of each test: 15
2015/06/07 08:27:10 Time to allocate memory: 565.273µs
2015/06/07 08:27:10 ===========================================
2015/06/07 08:27:10 Current test size: 100000000
2015/06/07 08:27:22 Time to fill the array: 677.755324ms
2015/06/07 08:27:22 Time to calculate: 113.095753ms
$ ./tcc-go-desvio-padrao -w 4 -n 15 -t 100000000
2015/06/07 08:27:28 Workers: 4
2015/06/07 08:27:28 Tests: [100000000]
2015/06/07 08:27:28 # of executions of each test: 15
2015/06/07 08:27:28 Time to allocate memory: 576.568µs
2015/06/07 08:27:28 ===========================================
2015/06/07 08:27:28 Current test size: 100000000
2015/06/07 08:27:34 Time to fill the array: 353.646193ms
2015/06/07 08:27:34 Time to calculate: 79.86221ms

The timings appear about the same as the OpenMP version.

Thanks for your answer. On my results section I've just removed unnecessary content to go direct to the point. I've tested the code on my university's machine, which is very similar to mine, and got the same results. I'm thinking that, since my CPU is already very fast, the overhead of parallelizing may overcome the benefits. Does that make sense? — DiogoDoreto, Jun 06 '15 at 20:15
@DiogoDoreto: I suppose that could be the case. I am wondering, though, why the program exhibits performance improvements when re-written in Go. Would you mind posting the Go version of your program? Also, what is the speed of your memory? I am using two 8 GiB DDR3 memory banks @ 1600 MHz. — Daniel Trebbien, Jun 06 '15 at 20:39
I have 1 of 8192MB @ 1600MHz. You can see the relevant part of Go code here: https://bitbucket.org/DiogoDoreto/tcc-go-desvio-padr-o/src/192b2a83bc928ff6bd402cbc6742f04baeb692a5/parallel.go?at=master — DiogoDoreto, Jun 06 '15 at 21:25
Hi @DiogoDoreto. I have updated my answer with timings from running your Go version. I get approximately the same timings as the OpenMP version. — Daniel Trebbien, Jun 07 '15 at 12:37
Thanks for trying to help and running the codes in your side. But what I am expecting as answer is an explanation of why this is happening on my side, as I was able to reproduce this behavior on others systems. — DiogoDoreto, Jun 07 '15 at 18:34

C OpenMP - Reduction scalability

PC Details:

Code:

Compilation line

Results

1 Answers1