-1

I posted a method checking a byte[] for being all zeroes. Mono.Simd offers so high performance that I am wondering if this is even possible.

static unsafe bool IsAllZeros(byte[] data)
{
    fixed (byte* bytes = data) {
        int len = data.Length;
        int rem = len % (16*16);
        Vector16b* b = (Vector16b*)bytes;
        Vector16b* e = b + len / (16*16);
        Vector16b zero = Vector16b.Zero;

        while (b < e) {
            if ((*(b)|*(b+1)|*(b+2)|*(b+3)|*(b+4)|*(b+5)|*(b+6)|*(b+7)|*(b+8)|
             *(b+9)|*(b+10)|*(b+11)|*(b+12)|*(b+13)|*(b+14)|*(b+15)) != zero)
                return false;
            b += 16;
        }

        for (int i = 0; i < rem; i++)
            if (data [len - 1 - i] != 0)
                return false;

        return true;
    }
}

Code above processes 256 MB in 2,6477 ms giving 94 GB/s. Is this even possible?

My DDR2 memory has 800 Mhz frequency. Wikipedia gives a formula for a theoretical maximum bandwidth as 800M*2*64*2 = 25 GB/s.

Community
  • 1
  • 1
ArekBulski
  • 4,520
  • 4
  • 39
  • 61
  • You are only processing 1/16 of the array. – Ben Voigt Oct 24 '15 at 00:27
  • Also is `b += 16` correct ? Shouldn't it be `b++` ? – Paul R Oct 24 '15 at 06:50
  • Loop processes 16 vectors in one iteration, that is why it moves 16. By the way, you should have posted it on code review page instead. https://codereview.stackexchange.com/questions/108470/sse-instruction-to-check-if-byte-array-is-zeroes-c – ArekBulski Oct 24 '15 at 15:10

1 Answers1

3

It's good that you did the bandwidth calculation, because it exposed a major bug in your algorithm -- the loop ends when it reaches e, which is only 1/16th of the way through the input.

The actual maximum theoretical bandwidth in your system is just a bit less than 12.8 GB/s (DDR2/6400 rating, times two channels, minus a few cycles that the DRAM is busy doing refresh and can't be accessed). This differs from the calculation in your question because you used a factor of two for DDR, and applied it to a number that already included that factor.

The bandwidth of your algorithm is 16MB per 2.65 ms or 6.0 GB/s (assuming that no non-zero elements were found, so the entire array needed to be scanned), about half the theoretical limit. Not bad at all for untuned C#, even with the Mono-SIMD weaver.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720