4

Suppose I have a byte[] and want to check if all bytes are zeros. For loop is an obvious way to do it, and LINQ All() is a fancy way to do it, but highest performance is critical.

How can I use Mono.Simd to speed up checking if byte array is full of zeroes? I am looking for cutting edge approach, not merely correct solution.

ArekBulski
  • 4,520
  • 4
  • 39
  • 61
  • When doing performance tests on .NET apps, you should make sure to run a few times and skip the first one because the JIT can get involved there. If you want to talk about the absolute fastest performance, then you should probably specify the hardware, too... Using BenchmarkDotNet to run your different candidates and report the results would be ideal, because it makes sure to run the candidates in a way that's as accurate as we can get, and its output includes the run parameters like hardware, GC mode, etc. – Joe Amenta Jul 17 '17 at 12:12

2 Answers2

6

Best code is presented below. Other methods and time measuring are available in full source.

static unsafe bool BySimdUnrolled (byte[] data)
{
    fixed (byte* bytes = data) {
        int len = data.Length;
        int rem = len % (16 * 16);
        Vector16b* b = (Vector16b*)bytes;
        Vector16b* e = (Vector16b*)(bytes + len - rem);
        Vector16b zero = Vector16b.Zero;

        while (b < e) {
            if ((*(b) | *(b + 1) | *(b + 2) | *(b + 3) | *(b + 4) |
                *(b + 5) | *(b + 6) | *(b + 7) | *(b + 8) |
                *(b + 9) | *(b + 10) | *(b + 11) | *(b + 12) | 
                *(b + 13) | *(b + 14) | *(b + 15)) != zero)
                return false;
            b += 16;
        }

        for (int i = 0; i < rem; i++)
            if (data [len - 1 - i] != 0)
                return false;

        return true;
    }
}

Eventually it was beaten by this code:

static unsafe bool ByFixedLongUnrolled (byte[] data)
{
    fixed (byte* bytes = data) {
        int len = data.Length;
        int rem = len % (sizeof(long) * 16);
        long* b = (long*)bytes;
        long* e = (long*)(bytes + len - rem);

        while (b < e) {
            if ((*(b) | *(b + 1) | *(b + 2) | *(b + 3) | *(b + 4) |
                *(b + 5) | *(b + 6) | *(b + 7) | *(b + 8) |
                *(b + 9) | *(b + 10) | *(b + 11) | *(b + 12) | 
                *(b + 13) | *(b + 14) | *(b + 15)) != 0)
                return false;
            b += 16;
        }

        for (int i = 0; i < rem; i++)
            if (data [len - 1 - i] != 0)
                return false;

        return true;
    }
}

Time measurements (on 256MB array):

LINQ All(b => b == 0)                   : 6350,4185 ms
Foreach over byte[]                     : 580,4394 ms
For with byte[].Length property         : 809,7283 ms
For with Length in local variable       : 407,2158 ms
For unrolled 16 times                   : 334,8038 ms
For fixed byte*                         : 272,386 ms
For fixed byte* unrolled 16 times       : 141,2775 ms
For fixed long*                         : 52,0284 ms
For fixed long* unrolled 16 times       : 25,9794 ms
SIMD Vector16b equals Vector16b.Zero    : 56,9328 ms
SIMD Vector16b also unrolled 16 times   : 32,6358 ms

Conclusions:

  • Mono.Simd has only a limited set of instructions. I found no instructions for computing scalar sum(vector) or max(vector). There is however vector equality operator returning bool.
  • Loop unrolling is a powerful technique. Even fastest code benefits a lot from using it.
  • LINQ is ungodly slow because it uses delegates from lambda expressions. If you need cutting edge performance then clearly that is not the way to go.
  • All methods presented use short circuit evaluation, meaning they end as soon as they encounter non-zero.
  • SIMD code was eventually beaten. There are other questions on SO disputing whether SIMD actually makes things faster.

Posted this code on Peer Review, so far 2 bugs found and fixed.

ArekBulski
  • 4,520
  • 4
  • 39
  • 61
  • This assumes your array is 16*N in length, a big assumption, but could be valid in controlled environments , also from your times on BySimdEquals I would highly assume that you are not running this with O=simd and thus are getting non-simd O=-simd times(?), which really do not improve that code execution times by that much. Coding this in C and p/invoking a GC-pinned array would be the faster. – SushiHangover Oct 23 '15 at 06:42
  • Unrolled version is indeed faster, but unrolling the loop 2 times (thus comparing only 2 x 8 bytes each loop) give similar (if not better) performance as 16 times on my machine. It make sense when you know most x64 machines have only two 64-bit data channels (if you have 2 memory sticks and they are installed in the proper slots). Memory read is probably the biggest bottleneck here. – tigrou Aug 25 '20 at 12:23
1

The scalar implementation processes long's, which are 64-bit (8-bytes) at a time, and gets most of it's speedup from this parallelism, which is powerful.

The SIMD/SSE above code uses 128-bit SIMD/SSE (16-byte) instructions. When using the newer 256-bit (32-byte) SSE instructions, the SIMD implementation is about 10% faster. With AVX/AVX2 instructions at 512-bits (64-byte) in the latest processors, SIMD implementation using these should be even faster.

    private static bool ZeroDetectSseInner(this byte[] arrayToOr, int l, int r)
    {
        var zeroVector = new Vector<byte>(0);
        int concurrentAmount = 4;
        int sseIndexEnd = l + ((r - l + 1) / (Vector<byte>.Count * concurrentAmount)) * (Vector<byte>.Count * concurrentAmount);
        int i;
        int offset1 = Vector<byte>.Count;
        int offset2 = Vector<byte>.Count * 2;
        int offset3 = Vector<byte>.Count * 3;
        int increment = Vector<byte>.Count * concurrentAmount;
        for (i = l; i < sseIndexEnd; i += increment)
        {
            var inVector  = new Vector<byte>(arrayToOr, i          );
            inVector     |= new Vector<byte>(arrayToOr, i + offset1);
            inVector     |= new Vector<byte>(arrayToOr, i + offset2);
            inVector     |= new Vector<byte>(arrayToOr, i + offset3);
            if (!Vector.EqualsAll(inVector, zeroVector))
                return false;
        }
        byte overallOr = 0;
        for (; i <= r; i++)
            overallOr |= arrayToOr[i];
        return overallOr == 0;
    }

    public static bool ZeroValueDetectSse(this byte[] arrayToDetect)
    {
        return arrayToDetect.ZeroDetectSseInner(0, arrayToDetect.Length - 1);
    }

An improved version (thanks to Peter's suggestion) is shown in the above code, is safe and has been integrated into the HPCsharp nuget package, for a 20% speedup using 256-bit SSE instructions.

DragonSpit
  • 458
  • 4
  • 9
  • 2
    Why would you `|=` into an accumulator but still check that accumulator every iteration? It makes sense to `|=` a cache line or two of vectors together for one `pcmpeqb` / `pmovmskb` / `test`/ `jnz` loop break condition. But you want to start a fresh `orVector` after finding the previous was all zero, breaking the dependency chain. If this compiles as written, it's limited to 1 vector per cycle at best (data dependency through `orVector`, not 2x 16-byte loads per clock like modern x86 can do (AMD since K10, Intel since Sandybridge). Or 2x 32-byte loads per clock on Haswell and later. – Peter Cordes Jun 16 '19 at 16:45
  • Yeah, this was the initial implementation to show that the scalar unrolled loop can be beaten. The follow on implementation uses several independent orVectors within the loop to break dependencies. Short-circuit testing them for non-zero without introducing dependencies is the struggle. Would love to see suggestions. – DragonSpit Jun 16 '19 at 17:15
  • 1
    At the top of a manually-unrolled loop in the source, do `orVector = new Vector(arrayToOr, i);` then for later vectors do `orVector |= new Vector(arrayToOr, i+1);` ; `... i + 2` etc. At the bottom of this loop, test it for non-zero. – Peter Cordes Jun 16 '19 at 18:16
  • 1
    Thank you for the explanation - helps a lot and made it concrete. This method brought another 10% speedup for a total gain of 20% with 256-bit (32-byte) SSE safe C# implementation over the unrolled scalar implementation above. Could you elaborate further how this method does not have dependencies, as it OR's all of the read Vector's into the single orVector? I implemented separate/multiple orVectors, but only for a minimal performance gain over your suggestion. – DragonSpit Jun 16 '19 at 23:07
  • 1
    It has a dependency *within* the body of a single loop iteration, but breaks dependencies *across* loop iterations. So there's no loop-carried dependency (except the pointer-increment). This lets out-of-order execution overlap the work in separate iterations because they're separate dependency chains. (Branch prediction / speculative execution avoids waiting for the control dependency of the conditional branch out of the loop based on load+ALU results. i.e. branch speculation breaks data dependencies.) – Peter Cordes Jun 16 '19 at 23:57
  • If you have an improved version, you should put that in your answer instead of leaving it with this sub-optimal version. At *least* link to the improved version. – Peter Cordes Jun 27 '19 at 07:25
  • Sorry for not posting the latest version, especially since the library is open source. The latest has now been posted. – DragonSpit Jun 27 '19 at 11:40