1

I want to effectively parallelize the following sum in C:

  #pragma omp parallel for num_threads(nth)
  for(int i = 0; i < l; ++i) pout[pg[i]] += px[i];

where px is a pointer to a double array x of size l containing some data, pg is a pointer to an integer array g of size l that assigns each data point in x to one of ng groups which occur in a random order, and pout is a pointer to a double array out of size ng which is initialized with zeros and contains the result of summing x over the grouping defined by g.

The code above works, but the performance is not optimal so I wonder if there is somewthing I can do in OpenMP (such as a reduction() clause) to improve the execution. The dimensions l and ng of the arrays, and the number of threads nth are available to me and fixed beforehand. I cannot directly access the arrays, only the pointers are passed to a function which does the parallel sum.

Sebastian
  • 1,067
  • 7
  • 12
  • either iterate over groups or apply reduction over whole `pout` array if OMP in your compiler supports it – tstanisl Jan 07 '22 at 17:09
  • 1
    The kind of indirect addressing you show is is about as *cache-unfriendly* as iterative code gets; then you throw multiple threads at it so they can all contend over the shared cache. I wouldn't be surprised to learn that single-threaded execution is as good as it gets. – High Performance Mark Jan 07 '22 at 17:09
  • By "The code above works", you mean the sequential implementation without the OpenMP pragma isn't it? Otherwise, it would be surprising. How big are `l` and `ng` in practice? How many threads do you expect to approximately use? The best resulting algorithm will change a lot based on these values. – Jérôme Richard Jan 07 '22 at 17:24
  • Thanks for the comments. The code is an interior component of some data manipulation software, arrays can be up to 100 million elements or more. The inputs are as they are, rewriting the whole thing to execute one group at a time would be tedious and not faster. I am a beginner in C programming @HighPerformanceMark so if you can show me how to write this loop in a better way I'd highly appreciate. As far as my benchmarks go, using pointers is not much slower than direct array indexation. Parallel execution gives about 2x speedup with 4 threads, so not optimal. The software runs on PC's. – Sebastian Jan 07 '22 at 18:06

2 Answers2

3
  1. Your code has a data race (at line pout[pg[i]] += ...), you should fix it first, then worry about its performance.

  2. if ng is not too big and you use OpenMP 4.5+, the most efficient solution is using reduction: #pragma omp parallel for num_threads(nth) reduction(+:pout[:ng])

  3. if ng is too big, most probably the best idea is to use a serial version of the program on PCs. Note that your code will be correct by adding #pragma omp atomic before pout[pg[i]] += .., but its performance is questionable.

Laci
  • 2,738
  • 1
  • 13
  • 22
  • Thanks a lot also! I will also check this out. – Sebastian Jan 08 '22 at 09:52
  • So I tried this with 100 Million random normal variates and 1 million random groups. Serial execution is 1.3 sec, execution with two threads gives about 1 sec, execution with 4 threads about 0.8 sec. The result is more numerically correct than what I have above (indeed there were data race issues), but small numerical issues remain (mean relative difference 1.4382e-06). Overall I think OpenMP alone does not solve the issue sufficiently well, so I'll go with serial code for the current software release and look into the more elaborate parallel programming answers later. – Sebastian Jan 08 '22 at 14:02
  • 1
    @Sebastian Small numerical differences are to be expected with parallelization because floating point addition is not associative and the order of operations matters. Doesn't mean the parallelized version is worse, it may even be more accurate. If it bothers you, you could use Kahan summation for higher accuracy but that will cost a lot of runtime. See https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html for the famous "What Every Computer Scientist Should Know About Floating-Point Arithmetic" document – Homer512 Jan 08 '22 at 14:48
  • This only works if `pout` is statically allocated, right? Not if it's `malloc`ed or so? – Victor Eijkhout Jan 08 '22 at 15:27
  • Thanks @Homer512 for the clarification, then the result is reasonable. @VictorEijkhout `pout` is filled using `memset(pout, 0.0, sizeof(double) * ng);` – Sebastian Jan 08 '22 at 15:38
  • @VictorEijkhout Reduction of an array works if the original array is dynamically allocated, but as far as I know in each thread the local `pout` array will be allocated in the stack. – Laci Jan 09 '22 at 11:49
  • @VictorEijkhout If you check the [assembly code](https://godbolt.org/z/TbsTEfosb) gcc does this. – Laci Jan 09 '22 at 12:59
  • I just tried it and your solution works. For large core count however mine is 10-20 percent faster because it parallelizes the summing of partial results. Yours is way easier to code though. – Victor Eijkhout Jan 10 '22 at 02:57
  • @VictorEijkhout Indeed, it is very strange the way gcc does it. It does the summing in the parallel region, but uses locks (`GOMP_atomic_start` and `GOMP_atomic_end` calls) instead of hardware level atomic operation. I hope it will be improved in future versions. Have you tried other compilers as well? – Laci Jan 10 '22 at 07:06
  • I usually only test with the Intel compiler which is rather better than GCC where OpenMP is concerned. Your solution segfaults with gcc9. Will research. – Victor Eijkhout Jan 10 '22 at 12:57
2

From your description it sounds like you have a many-to-few mapping. That is a big problem for parallelism because you likely have write conflicts in the target array. Attempts to control with critical sections or locks will probably only slow down the code.

Unless it is prohibitive in memory, I would give each thread a private copy of pout and sum into that, then add those copies together. Now the reading of the source array can be nicely divided up between the threads. If the pout array is not too large, your speedup should be decent.

Here is the crucial bit of code:

#pragma omp parallel shared(sum,threadsum)
  {
    int thread = omp_get_thread_num(),
      myfirst = thread*ngroups;
    #pragma omp for
    for ( int i=0; i<biglen; i++ )
      threadsum[ myfirst+indexes[i] ] += 1;
    #pragma omp for
    for ( int igrp=0; igrp<ngroups; igrp++ )
      for ( int t=0; t<nthreads; t++ )
        sum[igrp] += threadsum[ t*ngroups+igrp ];
  }

Now for the tricky bit. I'm using an index array of size 100M, but the number of groups is crucial. With 5000 groups I get good speedup, but with only 50, even though I've eliminated things like false sharing, I get pathetic or no speedup. This is not clear to me yet.

Final word: I also coded @Laci's solution of just using a reduction. Testing on 1M groups output: For 2-8 threads the reduction solution is actually faster, but for higher thread counts I win by almost a factor of 2 because the reduction solution repeatedly adds the whole array while I sum it just once, and then in parallel. For smaller numbers of groups the reduction is probably preferred overall.

Victor Eijkhout
  • 5,088
  • 2
  • 22
  • 23
  • Thanks! I understand what you are saying and it sounds like a good solution to me. Given that I am new to parallel programming, would it be possible for you to give me a basic template of how that could look like? – Sebastian Jan 07 '22 at 18:24
  • Thanks a lot! I will experiment around with this! – Sebastian Jan 08 '22 at 09:52
  • You stated that you could have 100M elements in the input. What is a typical value of the number of groups? – Victor Eijkhout Jan 08 '22 at 12:39
  • I wonder if the poor performance with N=50 is related to store forwarding penalties. Basically, a 3-5 cycle penalty when reading values that were written very recently. This naturally has a higher chance of occuring in small N, though I would have expected this to happen with much smaller N. What distribution are your input values? It could also relate to cache-line bouncing at the border of the per-thread arrays. That should go away when N is a multiple of 8. – Homer512 Jan 08 '22 at 12:55
  • For the N multiple of 8 trick, you also need to align the whole array. Try ```posix_memalign``` with 64 byte for allocation if you have a Unix system – Homer512 Jan 08 '22 at 12:59
  • @Homer512 I knew that and I should have tried it. That was a big help. With a combination of aligned arrays and groups-multiple-of-8 the behavior gets a lot better. 64 groups: 2 threads same times as one; used to be 3x slower; the best I get is 10x speedup with 48 threads. With 6400 groups almost linear scaling to 8-16 threads, then leveling off; 25x with 48 threads. – Victor Eijkhout Jan 08 '22 at 13:36
  • @VictorEijkhout For small N you could try using multiple redundant sums, e.g. 4 sums per value, and cycle through them. That may avoid the load-store forwarding. It would be interesting to see if you have a big performance difference between all-zero indices (worst case) and randomly distributed ones (average case). – Homer512 Jan 08 '22 at 14:03
  • Regarding the number of groups, it can vary widely, but typical statistical use cases (e.g. sum bilateral trade data from year-country-partner-product to year-country-partner level would be a reduction of 80 million observations to 1 million groups). – Sebastian Jan 08 '22 at 14:17
  • Reducing to 1 million is interesting. Then my worries about conflicts in the small case evaporate. On the other hand then @Homer512 's approach is at least more memory efficient: I need a copy of the output per thread, he doesn't. – Victor Eijkhout Jan 08 '22 at 14:29
  • @VictorEijkhout FYI I've deleted my answer as more benchmarks showed that by the point where it is faster than yours (more groups than elements), the sequential case is faster than both our approaches. – Homer512 Jan 08 '22 at 17:15
  • 1
    @Homer512 I thought your approach was interesting. And benchmarking depends on the machine. I have 56 core two-socket nodes with tons of bandwidth. – Victor Eijkhout Jan 08 '22 at 18:53