Openmp simd (increment vector)

Question

I have tried to apply #pragma omp simd to the following code (loops) but it does not seem to work (no speed improvement). I also tried #pragma omp simd linear but all my attempts resulted in a seg fault.

https://github.com/Rdatatable/data.table/blob/master/src/fsort.c#L209 https://github.com/Rdatatable/data.table/blob/master/src/fsort.c#L184

Is it even possible to increment a vector with simd? Example:

#include <stdio.h>
#include <stdlib.h>

int main() {
  int len = 1000;
  int tmp[len];
  for(int i=0; i<len; ++i) {
    tmp[i]=rand()%100;
  }
  int *thisCounts = (int *) calloc(len, sizeof(int));
  for (int j=0; j<len; ++j) {
    thisCounts[tmp[j]]++;
  }
  for (int j=0; j<len; ++j) {
    printf("%d, ",thisCounts[j]);
  }
  free(thisCounts);
  return 0;
}

FYI, line 209 is the one that takes most time and I am trying to improve.

Thank you

Please [edit] your question to make it self-contained (reduce the code you want to optimize to a [mre]). — chtz, Jun 06 '21 at 15:59
This is essentially a histogramming operation, which doesn’t readily vectorise with SIMD. — Paul R, Jun 06 '21 at 16:34
Some related questions: https://stackoverflow.com/questions/12985949/methods-to-vectorise-histogram-in-simd, https://stackoverflow.com/questions/30970060/optimizing-simd-histogram-calculation, https://stackoverflow.com/questions/38501529/how-to-optimize-histogram-statistics-with-neon-intrinsics — chtz, Jun 06 '21 at 23:48
Also semi related: [Micro Optimization of a 4-bucket histogram of a large array or list](https://stackoverflow.com/q/61122144) does cover the general case (of more buckets than a SIMD vector has elements, which is what made that version efficient). And [How to speed up this histogram of LUT lookups?](https://stackoverflow.com/q/39266476) has some AVX2 / AVX-512 links at the bottom. — Peter Cordes, Jun 07 '21 at 02:53

score 2 · Answer 1 · answered Jun 06 '21 at 19:25

2

It depends of the target hardware architecture. Many processor architectures does not have SIMD instruction performing such kind of indirect accesses. On mainstream x86-64 processors, there is a scatter/gather instruction to perform such a computation. However, they are not efficiently implemented and thus not significantly faster than using non-SIMD instructions. Moreover, using them is difficult here since there is possibly some increment conflicts (if tmp[j1] == tmp[j2] with j1 != j2. The AVX-512 SIMD instruction set contains interesting instructions for that but it is only available on few recent processors. The same apply for ARM with SVE/SVE2 which is very new and not yet available on the vast majority of ARM processors.

Thus, put it shortly, there is very slight chance your processor can possibly do that using SIMD instructions, but it does not means it is not possible on all architecture. Note also that using #pragma omp simd is likely not correct here because of possible conflicts. Note also that the speed of this operation is likely dependent of the input data on a lot of modern processors (random data do not behave like most real-world possible inputs).

answered Jun 06 '21 at 19:25

Jérôme Richard

41,678
6
29
59

2

If the histogram doesn't have a huge number of buckets, and long runs involving the same bucket many times are possible or likely, it can make sense to unroll with multiple arrays of counts, to break up the store/reload serial dependency if `counts[33]` has to get incremented 50 times over a run of 64 inputs, for example. (Reducing at the end down to one array is just a simple array sum, and SIMD can do that.) Example: [Methods to vectorise histogram in SIMD?](https://stackoverflow.com/q/12985949) – Peter Cordes Jun 07 '21 at 02:43
You can of course parallelize across threads as long as each one has its own count array (or set of count arrays). – Peter Cordes Jun 07 '21 at 02:43
1

Hmm, it might be possible to skip conflict detection by having each SIMD vector element use a different array: for a vector of inputs, add a vector like `[0, n, 2n, 3n, ...]` to make each SIMD element use a different region of one large array. Only useful with SIMD scatters (AVX-512 / SVE), or possibly with SIMD gather / manual scatter but then you'd have to redo the store addressing in scalar code. – Peter Cordes Jun 07 '21 at 02:50
3

Yeah, replicating the counts for each histo bucket should work; updated my answer on [How to speed up this histogram of LUT lookups?](https://stackoverflow.com/a/39271518) with C with AVX-512 intrinsics. https://godbolt.org/z/EoT6Taz6z – Peter Cordes Jun 07 '21 at 04:19
Good point. That being said, from my experience, unrolling from relatively small LUT gives almost the best results on relatively recent x86 (Intel/AMD) processors. The issue with using multiple LUTs is that they may not fit all in the cache resulting in a slower computation due to the higher latency. Thus, I experienced mitigated results with replicating LUTs 8x time, 16x or more in practice (4x is often good or 2x for huge LUTs). – Jérôme Richard Jun 07 '21 at 09:04
1

Yeah, 16x (the number of int32 elements in a `__m512i`) is excessive for anything bigger than the 32 bins in the linked Q&A. For this case with 100 bins (from the `rand()%100;`, despite allocating a count array based on the length of the input, not its value-range), probably better to use `__m256i` (8x elements) if using AVX-512 gather/scatter, even though that sacrifices some per-element throughput for `vpgatherdd` / `vpscatterdd` especially on Ice Lake. Or maybe still do conflict detection, but only between matching pairs of elements. (e.g. vextract high half / vpcmpeqd) – Peter Cordes Jun 07 '21 at 09:24

Openmp simd (increment vector)

1 Answers1