0

I am trying to do something with SIMD calculations. I have come quite far in my problem where I then get stuck and wonder how this could be done.

I think the easiest way is to describe this step by step what I have done:

I use Vector128<byte> which then handles 16 bytes at a time

  1. I have created a 2 dimensional array(array2D) with 9 columns and 16 rows per column. I have put the numbers in a sequence of: 0 and 2. This means that for example Row: 0 has only 0s. Row: 1 has only 2s etc.

  2. Now I Avx.LoadVector128 for each column/dimension which gives: 9 Vector128<byte> which I put in: dimensionLIST

  3. Now the task is to count how many of the numbers: 0 and 2 that could be found on EACH ROW. (We have 16 rows). This information is in the end stored in: counts[0]

  4. Looking at the result of counts[0] in the MessageBox. Below is shown: MessageBox.Show(counts[0]);

(represents 16 rows)
[0,9,0,9,0,9,0,9,0,9,0,9,0,9,0,9]

9, 2s were found on every other row.

Now the goal is to count how many "9" that were found in:
[0,9,0,9,0,9,0,9,0,9,0,9,0,9,0,9] which is 8.

So somehow we want the the integer 8 as Scalar somehow here?

    public unsafe static void SIMDfunction()
    {
        //Create dummy values
        byte[,] array2D = new byte[9, 16]; byte num = 0;
        for (int i = 0; i < 9; i++)
        {
            for (int i2 = 0; i2 < 16; i2++)
            {
                array2D[i, i2] = num;
                if (num == 0) { num = 2; } else { num = 0; }
            }
        }


        /*----------------------------------------------------------------------------------------*/
        unsafe
        {
            //Below starts SIMD calculations!
            fixed (byte* ptr = array2D)
            {
                //Add all 9 dimensions as Vector128
                List<Vector128<byte>> dimensionLIST = new List<Vector128<byte>>();
                for (int i = 0; i < 9; i++)
                {
                    byte* featuredimension = &*((byte*)(ptr + i * 16)); //This gives the first dimension with start: 0
                    dimensionLIST.Add(Avx.LoadVector128(&featuredimension[0])); //add "featuredimension" as a vector of the 16 next numbers: [0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3]
                }


                //Now count how many of: 0,1,2,3 are found in total in all "dimensionLIST" together?
                Span<Vector128<byte>> counts = stackalloc Vector128<byte>[1];
                Span<Vector128<UInt64>> sum64 = stackalloc Vector128<UInt64>[1];
                byte nr2 = 2; byte nr3 = 9; 
                for (int i = 0; i < dimensionLIST.Count; i++) //Each column
                {
                    //Compare: dimensionLIST[i] with Vector128 val to find out how many matches of 2 in this loop
                    //[0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2], [2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]
                    var match = Avx.CompareEqual(dimensionLIST[i], Vector128.Create(nr2)); //Create Vector128 for numbers: 2
                    counts[0] = Avx.Subtract(counts[0], match);
                }
                //STEP1: Show result on how many 2s are found == 9 occurences of "2"!
                var result = Avx.CompareEqual(Vector128.Create(nr3), counts[0]); //counts[0]: [0,9,0,9,0,9,0,9,0,9,0,9,0,9,0,9] (In total 9 2s are found on those indexes)


                //result:[0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255] Puts - 1 where integer == 9
                MessageBox.Show(result.ToString());

                //Now the goal is to count how many "9" that were found in: [0,9,0,9,0,9,0,9,0,9,0,9,0,9,0,9] which is 8.
                //So somehow we want the the integer 8 as Scalar somehow here?
            }
        }
    }
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Andreas
  • 1,121
  • 4
  • 17
  • 34
  • Possibly related: [Micro Optimization of a 4-bucket histogram of a large array or list](https://stackoverflow.com/q/61122144) (with C++ intrinsics) - you're already doing the compare/sub, but you can infer `counts[3]` from `total - counts[0..2]` so your inner loop is smaller. – Peter Cordes Jul 18 '20 at 19:54
  • @Peter, yes that is true. It is related to the link there which I remember well you helped me with. I am not sure but I think the problem I have now is a little bit different? Where I try to find the largest number in `counts` where I in the end needs to create this new `Vector128` where I also have to put the number `0,1,2 or 3`? – Andreas Jul 18 '20 at 20:00
  • I think I need to wait with he bytes to words perheps if it is possible to see if it is possible to do anything for the main problem? – Andreas Jul 18 '20 at 20:02
  • Oh I'd forgotten that was also your question. :P You can still infer `counts[3]` from `set1(total/16) - counts[0..2]`. Assuming you only do this for the part of the array that's a multiple of the vector width, you can calculate exactly how many elements you've looked at for each vector slot as `total >> 4`. – Peter Cordes Jul 18 '20 at 20:12
  • [Find min/max value from a \_\_m128i](https://stackoverflow.com/q/40985572) shows how to find the horizontal min of bytes with `phminposuw`. [x86 Assembly - 2 largest values out of 4 given numbers](https://stackoverflow.com/q/40229202) shows how to use it for signed max, by subtracting from `0x8000` to map the signed range to unsigned in reverse order. Unsigned to unsigned could just subtract from 0, I think. You'd then need some post-processing to decide which of two maxes is the right one (2 word vectors from 2 halves of a byte vector). – Peter Cordes Jul 18 '20 at 20:20
  • yes :P. I am not sure if I understand. If I am thinking right. I think where I am stuck is that `counts` has if I take my example 4 different `Vector128` with 16 indexes each. If we look at the first index in each of those 4 Vectors. There is `10,0,0,0` where 10 means 10 zeros were found. It could also be: `2,5,3,2` which means that 5 represents 5 1s were found which then has the highest number of the 4 Vectors. – Andreas Jul 18 '20 at 20:21
  • The `set1(total/16) - counts[0..2]` I'm talking about is a SIMD subtraction, with `Avx.Subtract`. If you had 10 vectors of inputs, then total/16 = 10. This is just a faster way to get `counts[3]`, it doesn't avoid needing the horizontal-max post-processing step that you want to implement somehow. – Peter Cordes Jul 18 '20 at 20:24
  • Oh, I think I misunderstood the post-processing you want. You need a vertical max for each SIMD element, but instead of the value you want an integer that represents which vector it came from. "if most 0s are found on the row" was a confusing way to describe that, because you aren't looking for `0`, you're looking for `counts[0][i]` being > `counts[1..3][i]` – Peter Cordes Jul 18 '20 at 20:30
  • Yes that is right as you said last. I describe it a bit strange thats true :) But that is right. I am looking for the integer that represent which vector it came from that has the largest value. It was not possbile to write: `counts[0][1]` in the compiler. But yes how would one do this. As now I can only think of looping traditionally through all elements but wonder if there is any SIMD instruction that can vertically compare all 4 Vectors in `counts`. Sorry for my confusion. This is tricky for me :) – Andreas Jul 18 '20 at 20:40
  • So your 2D array is column-major, so a SIMD vector gets 16 different rows from the same column? Normally C uses row-major layout, where each SIMD element would be from a different column. I assume C# is similar. It would be better to state that up front in your question so it's easier to follow your description. Also, 9, 2s can be written less clumsily as "nine 2s" or if you have multiple things and/or larger numbers `9x 2` (where the "x" indicates a repeat count). – Peter Cordes Jul 19 '20 at 23:12
  • Point 3 is clumsily worded. It's not clear if `counts[0]` is supposed to count matches for either `0` or `2`, or if it literally counts `0` and also encodes the count for `2` via `total-counts[0]`. – Peter Cordes Jul 19 '20 at 23:19

2 Answers2

2

Answer to edited question: counting matches horizontally for one vector

This seems oversimplified, like in practice you won't know that 9 is the value you're looking for. But since you hard-coded that in your source, maybe that's what you want.

You're on the right track with pcmpeqb to find exact matches for that element you're looking for. Then you just need to horizontally sum those matches:

// This is C, sorry I don't know C#
#include <immintrin.h>

// input: a vector like {0,9,0,9,...} and a value like 9 or 0
int count_matches(__m128i v, int val)
{
    __m128i matches = _mm_cmpeq_epi8(v, _mm_set1_epi8(val));  // 0 or -1
    matches = _mm_sub_epi8(_mm_setzero_si128(), matches);     // 0 or +1
    __m128i hsums = _mm_sad_epu8(matches, _mm_setzero_si128());  // psadbw against 0 = hsum of each qword half = count ones
    __m128i high = _mm_unpackhi_epi64(hsums, hsums);      // punpckhqdq to extract high half
    hsums = _mm_add_epi32(hsums, high);                // paddd or paddw would be fine
    unsigned sum = _mm_cvtsi128_si32(hsums);           // movd to extract the low 32-bit scalar

    return sum;
}

(Godbolt - fun fact: clang10 "optimizes" the sub to pand with set1_epi8(1) from memory, even with -march=skylake where vpsubb has the same performance as vpand. Silly compiler.)

i.e. just horizontal sum the number of "true" elements from the pcmpeqb result.

Negating with psubb (or doing pand with set1(1)) before psadbw is more efficient than anything I could come up with for turning sum*255 into sum, if we had just horizontally added the 0 or 255 elements.

-(int8_t)sum or (int8_t)-sum compiles to movsx eax, al / neg eax which is 2 instructions (assuming we need the result as a 32-bit int), worse than vpsubb against a zeroed vector that already exists. Without AVX it might be better though, or if you're bottlenecked on back-end SIMD execution ports instead of the front-end.

sum/255 would obviously be bad, compilers don't have enough info to optimize that, that's why my answer doesn't use it. Another option is (sum + 16) >> 8, which happens to give the right answer for all i*255 values from 0*255 through 16*255. But shifts on Intel CPUs run on port 0 or 6, not any ALU port, so that's probably worse than neg/movsx. neg and movsx can run on any ALU port, so are most flexible at avoiding / not contending with back-end ALU pressure from surrounding code.

vpsubb runs on any of p0, p1, p5 on Intel Skylake and later, but less flexible on earlier CPUs. Without AVX, it might need a movdqa register copy, or redoing an xor-zeroing to make a new zero vector for psadbw.


Answer to original / title question, finding which of 4 vectors the max come from in each vertical SIMD element

After you have counts[0..3], if you have spare bits at the top of your counts, left shift by 2 and OR with a 0..3 tag number (to record where it came from) so you can use pmaxub to select the largest count, bringing the tag along with it.

The SIMD MAX operation will act on the whole (counts[i] << 2) | i, so the count part is the most-significant part of the integer value, but the tag part comes with it as part of the integer MAX operation.

The "tag" will act as a tie-breaker, biasing towards the higher i (i.e. toward 3 in your case). If you need equal counts to be treated as vector 0, then XOR with 3 or count down from 3 or something so the tag bits have integer values in the tie-break order you want.

// kind of pseudo-code, I don't know C# so this is more like C with intrinsics
for (int i=0 ; i<4 ; i++){
    counts[i] <<= 2;   // emulate non-existent psllb somehow; 2x paddb or psllw / pand
    counts[i] |= set1(i);  // low 2 bits = tag
}
__m128i m0 = _mm_max_epu8(counts[0], counts[1]);
__m128i m1 = _mm_max_epu8(counts[2], counts[3]);
__m128i max = _mm_max_epu8(m0, m1);
max = _mm_and_si128(max, _mm_set1_epi8(3));  // discard high 6 bits, keep low 2 = tag

Or if you don't have room to left shift without losing significant bits, unpack with set1(i) and use _mm_max_epu16 (pmaxuw), with a tree of 2x max -> 1x max separately for the high/low halves. So each integer is (count<<8) | i.

Then you have to re-pack down to just the low byte (tag), probably requiring you to mask away the value byte before _mm_packs_epi16 (packsswb). There's no true inverse to punpcklbw / punpckhbw; pack instructions do signed saturation not truncation.

However, the final mask + pack step is just 2x PAND with set1_epi16(0x00FF) on the inputs to feed one packsswb, not too complicated.


You can speed up calculating counts in the first place with Micro Optimization of a 4-bucket histogram of a large array or list - infer counts[3] from set1(total/16) - counts[0..2] (3 SIMD subtractions at the end of the loop, saving a compare/sub every iteration).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • That was interesting. I am trying to follow along. I looked up `pmaxub` and in C# it is: `Avx.Max(Vector128, Vector128)` So if I understand it can only compare 2 vectors at a time. I could actually redesign the core code to use only 2 Vectors in `counts` for 0,2 and 1,3 as it is different logic in the real code and also only have samples that are unequal like: 3,5,7 etc columns to prevent the tie-break with equal counts. But if I compare: `Avx.Max(,)` I will then only get the max value from the 2 vectors. – Andreas Jul 18 '20 at 20:56
  • Then I still need to understand which vector that gave the Max value? and then put a 0 or 2 for the 3rd Vector128 that I will create, which is the new vector somehow? – Andreas Jul 18 '20 at 20:56
  • I am guessing a little. If only comparing 2 vectors wich unequal number of columns. I wonder if it would be possible to use: `Avx.CompareGreaterThan(vector1,vector2)` and then on that result vector somehow exchange with another SIMD instruction all greater values to the integer `2` for example which represents the vector where a greater value was found. – Andreas Jul 18 '20 at 21:11
  • @Andreas: You keep track of where the value came from by putting that as an integer into the least-significant 2 bits (or low byte), before using vertical max. You can then combine as many vectors as you want, and at the end just discard the count part and keep the tag. Think through the code in my answer and you should see how it works for 4 count vectors. I'm not just hand-waving here, this should exactly work as written. (And maybe even compile with GNU C native vectors) – Peter Cordes Jul 18 '20 at 21:18
  • Yes okay, I will try to experiment with this. It will take some time for me I think. I need to try and see how those instructions plays out and see if I can come up with any code. – Andreas Jul 18 '20 at 21:22
  • Yes I will try to understand your example. I think when I rethinking the problem is that each row can either have the vectors 0,2 or 1,3 and this is known beforehand. That means that I could divide all work in 2 sections and only look for 0,2 or 1,3 and not 0,1,2,3 everywhere as there will be double iterations. Sorry that is my fault in my example! So perheps only using 2 vectors at a time will then be more efficient by dividing the work in 2 sections. – Andreas Jul 18 '20 at 21:27
  • I must ask what the `tag` is. That seems important. Can a vector have a `tag` attached to it somehow so it is possible to know which vector that had the greatest count? – Andreas Jul 18 '20 at 21:30
  • @Andreas: I made that more explicit in my answer. For `counts[1]`, it's the 2-bit `01` at the bottom after you shift and OR. Using a few bits of a value for a special meaning is typically called a "tag", like a tagged-pointer. In this case it doesn't disturb the result of plain SIMD MAX operations, which is the point of the answer. – Peter Cordes Jul 18 '20 at 21:32
  • Yes I think I understand how you mean. If I tag numbers then I know which one has changed. I will try to see what code I can come up with knowing all this now and see if I can do it. – Andreas Jul 18 '20 at 21:38
  • I have struggled with some code but I think its better to not post it. I have changed the problem sligthly as this should be more right for the solution I am trying to do. I have changed my original post with code which describes the problem I try to solve now. So I am still counting how many `2s` that could be found on each row. Here the maximum amount that could be found is: `9`. So I want to count how many `9s` that were found which is `8`. – Andreas Jul 19 '20 at 16:34
  • So I am stuck with: `var result: [0,-1,0,-1,0,-1,0,-1,0,-1,0,-1,0,-1,0,-1]`. This should represent that 8, 9s were found. But I am not sure how to get the integer 8 as Scalar from this? It seems that `-1` should represent where a `9` was found. – Andreas Jul 19 '20 at 16:35
  • Correction: `var result: [0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255]` Puts 255 where integer == 9 – Andreas Jul 19 '20 at 16:54
  • 1
    @Andreas: `-1` and `255` are the same number in 8-bit. Updated my answer. – Peter Cordes Jul 19 '20 at 23:30
  • Thank you for all the information Peter. I also felt that `sum/255` felt like something bad. I try to understand what you are writing but I am not sure what I should change exactly. There were so many idéas so I don't know what line that is wrong or what I need to change exactly?. Could I just exchange this line to something else? `Convert.ToInt32(Sse2.Add(sum64[0], upper).ToScalar()) / 255;` – Andreas Jul 21 '20 at 13:48
  • @Andreas: My current answer doesn't use `sum/255`. It presents 2 good ways of solving the 255 vs. 1 problem: SIMD negate before hsum, or scalar 8-bit negate. The code block with `count_matches` uses the SIMD method. You don't need to change anything, just transliterate it to C# intrinsics. I changed the phrasing in the paragraph about `sum/255`. I didn't really need to go into that much detail about alternatives, but my first version of that function had used `sum/255` so I was thinking about how bad that was. (It costs a 64-bit multiple and a shift, not a disaster by any means.) – Peter Cordes Jul 21 '20 at 13:50
  • I am not sure, I try to understand the c++. I am not enterily sure what the syntaxes are there: `unsigned sum = _mm_cvtsi128_si32(hsums); ` Is this the line that I should write in C# instead of `Convert.ToInt32(Sse2.Add(sum64[0], upper).ToScalar()) / 255;` Because I need to remove this `/ 255` solution somehow. Should I remove the whole line and replace it? `Negate before hsum` This one I am not sure of what to do. Negate means to change the operator sign from for example - to + if I understand?. There is no `Negate` syntax under the `Avx` or `Sse2` class which the 2 ones I can use. – Andreas Jul 21 '20 at 14:13
  • @Andreas: Negate is just an English word. I'm doing it by subtracting from zero. Read the comments on the first 2 lines of intrinsics; after `0 - cmp`, it's a vector of 0 or +1 elements. `_mm_cvtsi128_si32(hsums)` is equivalent to your `Sse2.Add(sum64[0], upper).ToScalar()`. I commented the code with the equivalent asm instruction, and with text description of what it does. Are those not as easy to understand as I hoped? I thought I made it pretty easy to follow and made sure every step was explained. – Peter Cordes Jul 21 '20 at 14:33
  • Yes I wondered about this Negate. I could see it all over google and kept wondering. Then I know. You describe wonderfully. It is only me who gets a little bit confused. c++ to c# makes it a bit more difficult. I have changed my answer below. I write a comment under that now. – Andreas Jul 21 '20 at 15:10
0

I have tried to create a code solution. However the code gives the correct answer: 8 in the end.

Somehow I want to count how many of: 255 are found in:
result:[0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255]

I then SumAbsoluteDifferences and UnpackHigh.

The thing I wonder about here that I don't know if it is a correct way or OK to do it like this where I divide by: 255? I only did it because it gives the correct answer with a mathematical arithmetic:

int NUM = Convert.ToInt32(Sse2.Add(sum64[0], upper).ToScalar()) / 255;

    public unsafe static void SIMDfunction2()
    {
        //Create dummy values
        byte[,] array2D = new byte[9, 16]; byte num = 0;
        for (int i = 0; i < 9; i++)
        {
            for (int i2 = 0; i2 < 16; i2++)
            {
                array2D[i, i2] = num;
                if (num == 0) { num = 2; } else { num = 0; }
            }
        }


        /*------------------------------------------------------------*/
        unsafe
        {
            //Below starts SIMD calculations!
            fixed (byte* ptr = array2D)
            {
                //Add all 9 dimensions as Vector128
                List<Vector128<byte>> dimensionLIST = new List<Vector128<byte>>();
                for (int i = 0; i < 9; i++)
                {
                    byte* featuredimension = &*((byte*)(ptr + i * 16)); //This gives the first dimension with start: 0
                    dimensionLIST.Add(Avx.LoadVector128(&featuredimension[0])); //add "featuredimension" as a vector of the 16 next numbers: 
                }


                //Now count how many of: 0,2 are found in total in all "dimensionLIST" together?
                Span<Vector128<byte>> matches = stackalloc Vector128<byte>[1];
                Span<Vector128<UInt64>> hsums = stackalloc Vector128<UInt64>[1];
                byte nr2 = 2; byte nr3 = 9;
                for (int i = 0; i < dimensionLIST.Count; i++) //Each column
                {
                    //STEP 0
                    //Compare: dimensionLIST[i] with Vector128 val to find out how many matches of 2 in this loop
                    //[0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2], [2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]
                    var match = Avx.CompareEqual(dimensionLIST[i], Vector128.Create(nr2)); //Create Vector128 for numbers: 2
                    matches[0] = Avx.Subtract(matches[0], match);
                }
                //STEP 1: Show result on how many 2s are found == 9 occurences of "2"!
                var result = Avx.CompareEqual(Vector128.Create(nr3), matches[0]); //matches[0]: [0,9,0,9,0,9,0,9,0,9,0,9,0,9,0,9] (In total 9 2s are found on those indexes)
                //result:[0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255] Puts 255 where integer == 9

                //STEP 2
                result = Avx.Subtract(Vector128<byte>.Zero, result);

                //STEP 3:
                hsums[0] = Sse2.SumAbsoluteDifferences(result, Vector128<byte>.Zero).AsUInt64();

                //STEP 4:
                Vector128<UInt64> upper = Sse2.UnpackHigh(hsums[0], hsums[0]).AsUInt64(); // punpckhqdq to extract high half

                //STEP 5:
                hsums[0] = Sse2.Add(hsums[0], upper); // paddd or paddw would be fine

                //STEP 6:
                UInt32 SUM = Convert.ToUInt32(hsums[0].ToScalar());

                //This returns: 8 which should be the correct SUM
                MessageBox.Show(SUM.ToString());
            }
        }
    }
Andreas
  • 1,121
  • 4
  • 17
  • 34
  • The first `Sse2.Add(sum64[0], SAD.AsUInt64())` is pointless. You're just adding it to a vector of all zeros for no reason, which is a no-op. Or if `sum64[0]` isn't zero-initialized, then it's a bug. – Peter Cordes Jul 21 '20 at 14:30
  • I have tried to follow your `count_matches` and translate this to C#. I have created `STEP0-6` just to easier understand where things goes wrong. You mentioned to remove the first `Sse2.Add(sum64[0], SAD.AsUInt64())`. I kept this in the code as for now, just trying to translate your code to C#. I changed the syntaxes to `matches` and `hsum`. – Andreas Jul 21 '20 at 15:12
  • You just changed the var name without removing the wrong or useless `Sse2.Add`. Can't you just do `Vector128 hsums = Sse2.SumAbsoluteDifferences(matches[0], Vector128.Zero).AsUInt64();` I don't see why you're adding it to the initial value of `hsums[0]`. IDK if there's some reason for using 1-element `Span>` instead of just `Vector128`, but that's a separate issue. – Peter Cordes Jul 21 '20 at 15:13
  • You get the wrong answer because you left out the subtract-from-zero step (before the SAD). IDK why you don't get 8*255, though. Maybe you have some other bug as well? – Peter Cordes Jul 21 '20 at 15:14
  • Oh, you're adding `upper` twice for no reason. Once as Step 5, and again as part of step 6. Keep it separate as step 5 and just conver `hsums[0]` to int32, not an Add result. I'd really recommend *understanding* the algorithm, so you can think through *why* each step of the C# code should be written as it is. And can verify the values in a debugger. – Peter Cordes Jul 21 '20 at 15:23
  • I changed to this line: `hsums[0] = Sse2.SumAbsoluteDifferences(matches[0], Vector128.Zero).AsUInt64();` but it still shows `108` in the `SUM`. I might need the `Span` later for more values. So I thought I could keep it. Before the SAD, I only uncommented the: `Avx.CompareEqual`? I think you didn't use it in your code at that step? – Andreas Jul 21 '20 at 15:24
  • Yes thats true I added `upper` twice. I removed it like you said and tried to convert `hsums[0]` to Int32 only. Converting `hsums[0]` to Int32 gave: `System.InvalidCastException: 'Unable to cast object of type 'System.Runtime.Intrinsics.Vector128`1[System.UInt64]' to type 'System.IConvertible'.'` – Andreas Jul 21 '20 at 15:31
  • Yes, that is true. There is something that I am not completely sure of what is happening in the algorithm so it stops me from be able to understand what is wrong and what values I should expect at the different steps. I understand it as a whole but if something unexpected happens, I am not sure of how the logic should be. – Andreas Jul 21 '20 at 15:33
  • Re: using an array instead of just `hsums`: do you follow the same logic for all other variables? Why use `int i` when you could use a `Span` and use `i[0]`. Using an array instead of a single variable is something you usually only do when you actually do have a use for it. And you normally don't need arrays of SIMD vectors, except sometimes for loop unrolling. – Peter Cordes Jul 21 '20 at 15:34
  • 1
    For extracting a scalar, did you try `hsums[0].AsUInt32()`? You could just declare hsums as a vector of 32-bit integers, instead of 64. `punpckhqdq` or `punpckhdq` doesn't matter in this case, and `paddd` could actually be better than `paddq` on some CPUs where it's faster. Anyway, IDK if that helps, I don't know C#'s intrinsic types well enough to see what that problem might be. – Peter Cordes Jul 21 '20 at 15:37
  • Yes I should follow the same logic for the other variables. Interesting, perheps `i[0]` is faster also if using that instead of an array? I see, must think of perheps then change that later too. – Andreas Jul 21 '20 at 15:46
  • I tried to change to: `UInt32 SUM = Convert.ToUInt32(hsums[0])` . This compiled but gave the debug error: `System.InvalidCastException: 'Unable to cast object of type 'System.Runtime.Intrinsics.Vector128`1[System.UInt64]' to type 'System.IConvertible'.'` – Andreas Jul 21 '20 at 15:47
  • I change it to this and now it did work: `UInt32 SUM = Convert.ToUInt32(hsums[0].ToScalar());` but the SUM shows `72` and not `8`. `hsums[0]` after SAD shows in a messagebox: `<36, 36>` so it has added `9*4 times and 9*4 times`. I am not sure how this will result in: 8 anywhere after this. Perheps something else before this should be different? – Andreas Jul 21 '20 at 15:59
  • Well you left out `Avx.CompareEqual`, so you're horizontally adding the vector of `0,9,0,9,...` rather than a vector of `0,1,0,1,...`. IDK what to say, you seriously need to understand what your code is doing. You should have 1 non-comment line of code for every line in my C function. You have the code there but it's commented out. You should be able to use the debugger directly instead of needing to recompile with a debug-print in a different place to examine vectors variables at different steps; that should make debugging faster. – Peter Cordes Jul 21 '20 at 16:09
  • Yes I could only see that you used `Avx.CompareEqual` once which I thought was the one I used in the loop. But you are right I need to understand some basics better. Because I get lost just because I miss some fundamental stuff. I don't understand why I get `255` values instead of `1`. So I added logic at `STEP 2` which resulted everything to give the `SUM: 8` in the end. But are not sure this is correct to do either. True too, it takes time to place messageboxes to examine values. – Andreas Jul 21 '20 at 16:20
  • 1
    You're doing `v - 255`, not `0 - v`. That will fail for any correct result other than `8`. Subtraction *from* zero to turn `-1` into `+1`. (Or in unsigned, 255 into 1 after wrapping). Again, this was totally clear in my code and discussed in comments multiple times, IDK why you invented some math that only happens to work for this special case. SIMD compare results produce all-ones, i.e. 2's complement `-1`, because that makes them useful as AND / ANDN masks. That's why you're doing cmp/sub in your inner loop. – Peter Cordes Jul 21 '20 at 16:25
  • TL:DR: I'm not interested in hearing about every little mistake you make along the way to understanding this. Please at least re-read my answer and go through the vector variable values with a debugger next time you run into something that doesn't work the way you expect, before pinging me with a comment. I'm not very interested in C#, otherwise I'd just write working code in C# in the first place (so you could I guess start using it without understanding how to write it or something). Trying not to be rude, but I'm losing interest in this problem. – Peter Cordes Jul 21 '20 at 16:32
  • I change places so this then works: `result = Avx.Subtract(Vector128.Zero, result);` The problem there was that I didn't understand that you could do this with 255. I know you wrote in a comment: `-1 and 255 are the same number in 8-bit`. But since I have never practiced this. The first natural math is that `0-255 = -255`. So the solution can not come to my mind. – Andreas Jul 21 '20 at 16:48
  • I am very thankful for your help. I can write C# fluently without problem but SIMD is like another language almost because it is so much going on here. With all the syntaxes you mentions in your answer and with c++ that makes it even more complicated where I don't really recognice the syntaxes. One complicated phrase or c++ syntax and ones gets lost easily. It is very easy to be confused when this is so new. It is difficult to find examples in C# also when googling around. I find many c++ examples. I think the code works now as it seems. I thank you very much for your patience and help Peter. – Andreas Jul 21 '20 at 16:50
  • Yeah, 3 different syntaxes (C# and C++ intrinsics, and asm) makes it problematic and easy to get confused. I find it helpful to think in terms of what I want the machine to actually do, i.e. in asm terms, and *then* figure out how to write that in whatever language. Most intrinsics documentation (for C++ or for C#) should list asm mnemonics. That way you avoid getting bogged down in details of how intrinsics choose to expose that functionality, at least when thinking about the design. – Peter Cordes Jul 21 '20 at 17:33
  • Re: numbers: If 8-bit unsigned wrapping didn't occur to you, the obvious operation would be `v & 1` to zero all the high bits of each element. Sse2.And, perhaps, with a vector of 16x `1` bytes. I can see why you might *think of* subtracting 255, but you should have been able to reject it right away because it turns `0` into non-`0`, and the "true" elements into `0`. I guess it's easy to lose sight of the basics when you're new to other parts of what you're doing, though. But still, check your works for mistakes like that before asking for more help, please. – Peter Cordes Jul 21 '20 at 17:35
  • Yes I wonder there are almost no examples in `C#` and so many `c++` examples. Is it not as common to use SIMD in C# perheps. It would be perfect to read some kind of manual for C# which really explains all those small details which me lost sometimes of what really did happen. It is a lot of bits, bytes. To count and `scalar` correctly from the Vectors. Yes you truly goes into details how it really works. Thats is really interesting. You are very good at this :) – Andreas Jul 21 '20 at 18:12
  • High and low bits, I have a fundamental understanding but not an exact understanding which I know confuses it a little along the way. That I need to understand better how this 255 exactly works. I think there are some basics there about bits and unsigned integers etc which I never thought to much about before which seems very important to understand here. Yes absolutely Peter, I will really try to check all mistakes carefully next time. I really understand you. Thousands of small questions from me :) – Andreas Jul 21 '20 at 18:16
  • I think C# SIMD intrinsics are *much* newer, like the last under 5 years at best for the style that uses `Sse2.Add` and so on, and can use `psadbw` at all. vs. the last 20 or more years for C++, probably since SSE1. An unmanaged language lends itself more naturally to SIMD loads/stores. re: 255: you just need to understand fixed-width integers, specifically unsigned vs. 2's complement. https://en.wikipedia.org/wiki/Two%27s_complement. In this case for 8-bit specifically; it's the same problem for `uint8_t` vs. `int8_t` or whatever C# calls them as for a single element of a SIMD vector. – Peter Cordes Jul 21 '20 at 18:34
  • That sounds like it could be that reason. Actually the first language i learned was c++. I wrote 1000 pages there but then I moved to C# instead. It was some time ago. I also thought C++ might be more natural for SIMD loads/stores. Nice, thank you. I will look at that link. If I can lean those things better, things might get a bit easier when using the SIMD syntaxes. – Andreas Jul 21 '20 at 19:00