4

In C# on .NET Core, I am seeking the fastest method to check whether a given ushort value is present within a Span<ushort> range. The naive option consists of enumerating the span, but I strongly suspect that a much faster single-core option exists through SIMD (i.e. SSE or AVX).

What is fastest option here? (unsafe code is OK)

Timo
  • 7,992
  • 4
  • 49
  • 67
Joannes Vermorel
  • 8,976
  • 12
  • 64
  • 104
  • Can you use some other structure, like `Hashset`? – Matěj Pokorný Nov 16 '18 at 18:04
  • 1
    Yes, SSE2 has `pcmpeqw` which lets you check 16 bytes at a time in your linear search. You can use it the same way a vectorized `memchr` loop is implemented (using `pmovmskb` to get an integer result). AVX2 has a 32-byte version of the same instruction. Various possible optimizations include checking a whole cache-line before branching (combining pcmpeq results with an OR). I don't know C#, but presumably that's a good starting point for manually vectorizing, if your compiler/runtime doesn't autovectorize for you even if you get to an alignment boundary, etc. – Peter Cordes Nov 16 '18 at 19:33
  • there's a trick to [determine if a word has a byte equal to n](https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord), but that probably won't be faster than SIMD – phuclv Mar 06 '19 at 13:22

1 Answers1

4

A basic implementation (before applying optimizations such as the ones described by Peter in the comments) might work like this:

static unsafe bool ContainsUshort(Span<ushort> data, ushort val)
{
    int vecSize = Vector<ushort>.Count;
    var value = new Vector<ushort>(val);
    int i;
    fixed (ushort* ptr = &data[0])
    {
        int limit = data.Length - vecSize;
        for (i = 0; i <= limit; i += vecSize)
        {
            var d = Unsafe.ReadUnaligned<Vector<ushort>>(ptr + i);
            if (Vector.EqualsAny(d, value))
                return true;
        }
    }
    for (; i < data.Length; i++)
    {
        if (data[i] == val)
            return true;
    }
    return false;
}

This requires the System.Runtime.CompilerServices.Unsafe package for the unsafe read, without that creating a vector from a span (or array too) is much less efficient. By the way the EqualsAny intrinsic is realized with (v)ptest instead of (v)pmovmskb, ptest typically costs more µops so it is comparatively more important to minimize its impact - but since there is no direct access to ptest or pmovmskb (this limitation can be avoided by using the newer platform-specific System.Runtime.Intrinsics.X86 API) the final "vector to condition" AFAIK still has to be done with Vector.EqualsAny (with a vector filled with 0xFFFF) which is a little silly.. nevertheless it was a little faster on my machine (tested such that the return value will be false, so the slightly earlier exit of the non-unrolled version didn't come into play)

var allSet = new Vector<ushort>(0xFFFF);
int limit = data.Length - vecSize * 2;
for (i = 0; i <= limit; i += vecSize * 2)
{
    var d0 = Unsafe.ReadUnaligned<Vector<ushort>>(ptr + i);
    var d1 = Unsafe.ReadUnaligned<Vector<ushort>>(ptr + i + vecSize);
    var eq = Vector.Equals(d0, value) | Vector.Equals(d1, value);
    if (Vector.EqualsAny(eq, allSet))
        return true;
}
harold
  • 61,398
  • 6
  • 86
  • 164
  • `System.Runtime.Intrinsics.X86` has `pmovmskb`. e.g. the AVX2 version: https://learn.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.avx2.movemask?view=netcore-3.1 – Peter Cordes Jul 30 '22 at 20:06
  • @PeterCordes yes I suppose I should modernize this answer for the modern vector API (which didn't exist in 2018 unfortunately), but it would take effort, so I don't know if I will – harold Jul 30 '22 at 21:59
  • Maybe just add a note that it's out of date, or my comment might be enough. You might be interested in answering [Find index of unaligned int or long in byte array using SIMD](https://stackoverflow.com/q/73178056) ; I didn't find an existing duplicate with C# intrinsics. Unfortunately the OP has no clue what they're doing with intrinsics so discussing what would be efficient in asm didn't seem to help them. – Peter Cordes Jul 30 '22 at 22:02