0

I have an array pointer int * counters in which there are some counters of all letters of the alphabet. There is a char * array with random letters and I have to count all the occurences of all letters in the array using AVX,AVX2 vectors. I am using __256i of 8-bit vectors, so 32 characters at a time and I want to update the counters based on the 32 characters I have at each iteration in the vector. The problem is that we cannot know which letters there are at the iteration, so the only way I could think of is update the counters serial based on the characters, that is:

for (int j = 0; j < 32; j++) {
   counters[array[j]-'a']++;
}

So, if it is possible, how could I change this for loop with SIMD executions so that I update more counters at a time and not updating them serial? Is there any instruction like gather that instead of getting data from memory, storing it in different locations in it?

I am sorry if what I am asking isn't that clear, so please ask to elaborate

Thank you in advance.

  • Do you want to focus on the "scatter" aspect, or histogramming in general? – harold Feb 05 '20 at 07:46
  • Begin by using normal compiler optimizations, and then look at the generated assembly code to see how good it might be already. Perhaps the compiler optimizations are good enough? If not then profile and benchmark, and see if this is truly a bottleneck, and if it's in the top-three. Generally, everything but the top two or three bottlenecks could be left for the compiler optimizer to deal with. – Some programmer dude Feb 05 '20 at 07:47
  • 1
    AVX2 has gather, but no scatter until AVX512. And conflicts (multiple elements with the same index) are hard to handle efficiently, but you'll have a lot with letters of the alphabet. It's not worth it, and not even worth trying without AVX512 on a CPU with efficient scatters. – Peter Cordes Feb 05 '20 at 08:02
  • @Someprogrammerdude Well, actually I tried to unroll the loop myself instead of using a for loop and it runs a little faster. But what I would want to do is using avx for that so that I could parallelize. I just can't think of anything for tha updating of the counters – Demetris Shimitras Feb 05 '20 at 08:09
  • If `array` contains only values between `'a'` and `'a'+31` you could load 4 bytes and extend them to 64bits using `VPMOVZXBQ`, then subtract `'a'` and multiply by two and shift a vector of 64bit `1` by that amount (using `VPSLLVQ`), and add three of those. That sum can be split into even and odd counts by anding with `0x33` (before and after shifting by two) and you can add 5 of those results. Then do the same with 15 of those (shifting by four/anding with `0x0f`) and so on. I did not check the details (register usage, latency, etc -- it might actually be better to start with 32bit registers). – chtz Feb 08 '20 at 14:23

0 Answers0