I am using Vector128<byte>
in C# in order to count matches from a byte array with 16 index.
This is part of implementing a byte version of Micro Optimization of a 4-bucket histogram of a large array or list, using the technique from How to count character occurrences using SIMD of widening 8-bit counters to 64 inside an outer loop (hsum_epu8_epu64
helper function), and then after all the loops of summing that vector of counters down to one scalar (hsum_epu64_scalar
).
So that C++ with Intel intrinsics has to be ported to C#. And without AVX2 so we're using 128-bit integer vectors, not 256.
The byte array consists of numbers 0
and 1
where 5 0
occurs.
The task now is to count those 5 0
where we can see that 2 of the 0
occurs in the upperband of the Vector128<byte>
and 3 of the 0
occurs in the lowerband of the Vector128<byte>
.
I have succeed with code all the way to where I Sse2.SumAbsoluteDifferences
and can extract the number of 0
for sumHigh
and sumLow
showing 3 and 2 respectively.
The problem starts now where I need to shuffle so the upperband and lowerband change places so I later can extract the opposites in: sumHigh
and sumLow
for sum64b
I have put a lot of comments in the code also so I think it is possible to follow the code and see there also exactly how I try to shuffle and complete the code.
(The code shows also that my AMD K10 processor supports: Sse, Sse2, Sse3)
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
private void button2_Click(object sender, EventArgs e)
{
//This shows what is supported on my processor. However it seems that I could use something from "Avx" anyway
bool avx = Avx.IsSupported; //false
bool avx2 = Avx2.IsSupported; //false
bool sse = Sse.IsSupported; //true
bool sse2 = Sse2.IsSupported; //true
bool sse3 = Sse3.IsSupported; //true
bool ssse3 = Ssse3.IsSupported; //false
bool sse41 = Sse41.IsSupported; //false
bool sse42 = Sse42.IsSupported; //false
//Create a bytearray of 16 indexes. As seen: '0' occur 2 times in the upper band and 3 times in the lower band
//We want to count those "0" in the below code
byte[] v1 = new byte[16];
v1[0] = 0; v1[1] = 0; v1[2] = 1; v1[3] = 1; v1[4] = 1; v1[5] = 1; v1[6] = 1; v1[7] = 1;
v1[8] = 1; v1[9] = 0; v1[10] = 0; v1[11] = 0; v1[12] = 1; v1[13] = 1; v1[14] = 1; v1[15] = 1;
Vector128<byte> counts = Vector128<byte>.Zero;
unsafe
{
fixed (byte* fixedInput = v1)
{
//Load byte Vector with 16 indexes
var v = Avx.LoadVector128(&fixedInput[0]);
//Now match how many "0" we can find in "Vector128: v". 'counts' show the result string where: '1' tells where we found: "0".
//As seen it happened as expected total times: 5 (2 times in the upper band and 3 times in the lower band of the Vector)
byte val = 0;
var match = Avx.CompareEqual(v, Vector128.Create(val));
counts = Avx.Subtract(counts, match); //counts: <1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0>
//Extract high/low bands
//So we use "SumAbsoluteDifferences" to "Separately sum the 8 low differences and 8 high differences to produce two unsigned word integer results."
//We can see on index 0: 2 and on index 4: 3
Vector128<ushort> sum64 = Vector128<ushort>.Zero;
sum64 = Sse2.Add(sum64, Sse2.SumAbsoluteDifferences(counts, Vector128<byte>.Zero)); //sum64: <2,0,0,0,3,0,0,0>
//I AM NOT SURE OF THE CODE BELOW HOW TO DO IT PROPERLY!
//Now I need to shuffle the above: "<2,0,0,0,3,0,0,0>" but are not sure of how the complete process is to do this correctly?
//Below is a start of an "attempt" but are not sure how to do this all the way correctly?
Vector128<uint> result = Sse2.Shuffle(sum64.AsUInt32(), 0xB1);
//Extract high/low bands from ther shuffle above?
//Vector128<uint> sum64b = Vector128<uint>.Zero;
//sum64b = Sse2.Add(sum64b, result);
//sumHigh = Sse2.Extract(sum64b, 1); //0
//sumLow = Sse2.Extract(sum64b, 0); //
}
}
}
Using 16-bit extracts would be possible but not usable for larger counts.
var sumHigh = Sse2.Extract(sum64, 4); // pextrw
var sumLow = Sse2.Extract(sum64, 0); //sumHigh == 3 and sumLow == 2
var sumScalar = SumLow + sumHigh;
Note from @PeterCordes: the real use-case would loop to add up to 255 vectors into counts
, then in an outer loop accumulate into wide elements in sum64
with Sse2.SumAbsoluteDifferences
and Sse2.Add
, and reset counts
. That part looks correct in this C# port, except that sum64
should not use ushort
elements.
The part this question is asking about is the horizontal sum of two 64-bit vector elements down to one scalar integer. (The real use-case has three vectors of counts, from 3 histogram buckets; a transpose and sum could work but simply doing separate horizontal sums for each vector is fine.)