0

I have a question if this is possible using SIMD instructions at all or if there could be some kind of Hybrid solution SIMD/Normal C# code solution to what I am trying to do or if it is not possible and will sacrifice to much speed of the SIMD instructions for it to be interesting.

My code counts how many 0,1,2,3 that exists in: inputArray (Micro Optimization of a 4-bucket histogram of a large array or list). Result is shown in: countArray
The correct answer to this is: 3,9,1,3.

My question now is this:
As seen we also have passed along: indexArray to the function. If we take 0 from inputArray as an example where we have 3 occurrences. We can see in the indexArray that those occurrences are at indexes: 9,43,5

I wonder if it is possible to add/collect this information in for example 4 buckets as well for each of the possible element values (0,1,2,3) in inputArray?


In the below code part we can see where the indexes for example: val = 1 is found: 0,9,10

I don't know if it is possible to catch those indexes somehow that could map/get the indexes from the indexArray?

The main goal to for example: ´0´ is to get an array in the end like this:
int[] indexes0 = new int[3]; //But we dont know that it will be 3 beforehand indexes0[0] = 9; indexes0[1] = 43; indexes0[2] = 5;

[1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0]

// accumulate match counts into 4 vectors of 8-bit counters, one for each val
var v = Avx.LoadVector128(&fixedInput[0]);
for (byte val = 0; val < 4; val++)
{
   //[1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0] -CompareEqual to: [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
   var match = Avx.CompareEqual(v, Vector128.Create(val)); 
   counts[val] = Avx.Subtract(counts[val], match);   // count -= 0 or -1
}

(Please note: I have a Piledriver CPU that can't run AVX2)

Complete Code (simplified for an array that's only one vector-width long):

using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
private void button3_Click(object sender, EventArgs e)
{
    //Create 16 indexes with numbers between: 0-3. The goal is to count how many of those occurences we have for the numbers: 0-3
    byte[] v1 = new byte[16]; int[] v2 = new int[16];

    v1[0] = 0; v2[0] = 9;
    v1[1] = 1; v2[1] = 12;
    v1[2] = 2; v2[2] = 55;
    v1[3] = 3; v2[3] = 23;
    v1[4] = 1; v2[4] = 568;
    v1[5] = 1; v2[5] = 4;
    v1[6] = 1; v2[6] = 6;
    v1[7] = 1; v2[7] = 1008;
    v1[8] = 1; v2[8] = 188800;
    v1[9] = 0; v2[9] = 43;
    v1[10] = 0; v2[10] = 5;
    v1[11] = 3; v2[11] = 2;
    v1[12] = 1; v2[12] = 1;
    v1[13] = 1; v2[13] = 58;
    v1[14] = 1; v2[14] = 8;
    v1[15] = 3; v2[15] = 15;
    /*---------------*/

    ReadOnlySpan<byte> inputArray = v1;
    ReadOnlySpan<int> indexArray = v2;

    //Call function
    countElements(inputArray, indexArray);
}

// simplified for inputArray.Count == 16 exactly, no looping or cleanup
// shows counts, doesn't find positions at all
private unsafe static void countElements(ReadOnlySpan<byte> inputArray, ReadOnlySpan<int> indexArray)
{
    int[,] indexes0123 = new int[,] { }; //Store arrays for indexes found for the numbers: 0,1,2,3 in the "indexArray"


    //Below starts the SIMD calculations!
    int[] countArray = new int[4];
    Span<Vector128<byte>> counts = stackalloc Vector128<byte>[4];
    Span<Vector128<UInt64>> sum64 = stackalloc Vector128<UInt64>[4];
    unsafe
    {
        fixed (byte* fixedInput = inputArray)
        {
            var v = Avx.LoadVector128(&fixedInput[0]);
            for (byte val = 0; val < 4; val++)
            {
                //[1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0] -CompareEqual to: [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
                var match = Avx.CompareEqual(v, Vector128.Create(val)); 
                counts[val] = Avx.Subtract(counts[val], match); 
            }

            //Here SumAbsoluteDifferences
            for (int i3 = 0; i3 < 4; i3++)
            {
                sum64[i3] = Sse2.Add(sum64[i3], Sse2.SumAbsoluteDifferences(counts[i3], Vector128<byte>.Zero).AsUInt64()); //sum64: <2,0,0,0,3,0,0,0>
            }

            //UnpackHigh and get the lower element from the Vector128<UInt64>
            for (int i3 = 0; i3 < 4; i3++)
            {
                Vector128<UInt64> upper = Sse2.UnpackHigh(sum64[i3], sum64[i3]).AsUInt64(); //3
                countArray[i3] += Sse2.Add(sum64[i3], upper).ToScalar();
            }


            //We now know how many 0,1,2,3 we have of each: (But we also need to collect which indexes from "indexlist" each of these 4 buckets has)
            //3,9,1,3
            MessageBox.Show(countArray[0] + "," + countArray[1] + "," + countArray[2] + "," + countArray[3]); 
        }
    }
}
Andreas
  • 1,121
  • 4
  • 17
  • 34
  • If you left-pack a vector of `[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]` according to the compare-mask in `match`, that would do it. Of course for this to work with medium to large arrays, the indices will have to be 16 bit or 32-bit, if not 64-bit, so you'd have to unpack to widen the elements before adding to a vector of `[base, base, base, base]` that you increment with `Sse2.Add()` with a vector of `[4,4,4,4]`. – Peter Cordes Apr 16 '20 at 18:47
  • See [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) for more about left-packing, but you'll want a 4-bit LUT of `pshufb` masks instead of calculating one on the fly. (And you don't have `pdep` and it's slow on AMD Zen even if you did have a newer CPU.) From an `Sse2.CompareEqual` mask, you get a bitmap with `pmovmskb` and use `>>` and `& 0xf` to get 4-bit chunks of it for four 4-element vectors of `[base +0, base + 1, base + 2, base + 3]`. – Peter Cordes Apr 16 '20 at 18:50
  • You don't actually have to *unpack* your byte elements, just use that mask to operate on a separate vector of index counters that you keep in sync by incrementing with Sse2.Add. – Peter Cordes Apr 16 '20 at 18:51
  • I am not exactly sure that I follow how you mean now. I edited/added one info in my post. Just to be sure I am more clear what I try to do. The goal to for example: `0` is to know what indexes that were found and get an array in the end like this: `indexes0[0] = 9; indexes0[1] = 43; indexes0[2] = 5;`. If I understand this is the mask I have: `[1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0]` I can see indexes: `0,9,10` there. From here I am not sure what to do exactly as I want the: `9,43,5` values? If I would keep anything in sync it would be an `Int32` vector with only 4 elements but I use a `byte/16`? – Andreas Apr 16 '20 at 19:26
  • 1
    Yes, you'd need to left-pack one vector of 4x int32 for every 4 bits of the 16-bit mask from your 16x int8 vector compare result. This sucks a lot, and will *greatly* slow down your code. It might not be faster than just doing this scalar, especially without AVX2. For each input vector of 16 values, counting 3 buckets needs 3x `pcmpeqb`, 3x `psubb` (and some occasional outer-loop work). This would raise that to 4x each of those, an additional 4x `pmovmskb`, 4 * 4 vector loads, 4 * 4 `pshufb` (SSSE3, won't work on your K10), 4 * 4 unaligned stores, and much scalar shift / mask / popcnt work. – Peter Cordes Apr 16 '20 at 19:36
  • Yes you are right this is what I am afraid of also. Perheps I will slow down the code so much by doing this and I should try to reorganise my base code completely and see if I can avoid this. But I will still be able to use this code as it is for 90% of the calculations that I do. – Andreas Apr 16 '20 at 19:41
  • I changed it to `int` now that was a mistake. I don't know how many code snippets I have now with this code so I think I pasted the wrong one in the question. We don't want doubles. – Andreas Apr 16 '20 at 19:42

0 Answers0