How can I eliminate array bound checking on this loop vectorization?

Question

I have a task to split multiple runs of varbinary(8000) columns from a database table on the binary literal 0x0. However, this might change so I'd so I'd like to keep this variable. I'd like to use SQLCLR to do this quickly as a streaming table value function. I know my strings are always going to be at least a few thousand bytes each.

Edit: I've updated my algorithm. To avoid the nastiness of the inner loop unrolling. But its extremely hard to convince the CLR to make right choices about register allocation. It would be awesome if there was an easy way to convince the CLR that j and i are REALLY the same thing. But instead it does really dumb things. It would be nice to optimize the first path loop. But you can't use a goto into a loop.

I decided to adapt a 64-bit implementation of the C function memchr. Basically instead of scanning one byte at a time and comparing, I scan 8 bytes at a time using some bit twiddling. For reference, Array.IndexOf<Byte> does something similar with a 4 byte scan for one answer, I just want to keep doing it. There are a couple of things to note:

Memory pressure is a very real problem in SQLCLR functions. String.Split is out because up front it allocates a lot of memory that I'd really like to avoid. It also works on UCS-2 strings, which would require me to convert my ascii strings to unicode strings, and thus treat my data as a lob datatype on the return. (SqlChars/SqlString can only return 4000 bytes before being turned into a lob type).
I'd like to stream. Another reason to avoid String.Split it does its work all at once, creating lots of memory pressure. On code with lots of delimiter pure T-SQL methods will start beating it.
I'd like to keep it "Safe". So all managed. There seems to be a very large penalty in a security check.

Buffer.BlockCopy is really fast, and it seems to be better to pay that cost up front once than constantly pay the cost for BitConverter. This is also still cheaper yet, than converting my input to string, and holding on to that reference.

The code is quite fast, but it seems that I'm paying for quite a few bound checks, in the initial loop and in the critical section when I find a match. As a result for code with lots of delimiters, I tend to lose to a simpler C# enumerator that just does a byte comparison.

Here is my code,

class SplitBytesEnumeratorA : IEnumerator
{
    // Fields
    private readonly byte[] _bytes;
    private readonly ulong[] _longs;
    private readonly ulong _comparer;
    private readonly Record _record = new Record();
    private int _start;
    private readonly int _length;

    // Methods
    internal SplitBytesEnumeratorA(byte[] bytes, byte delimiter)
    {
        this._bytes = bytes;
        this._length = bytes.Length;
        // we do this so that we can avoid a spillover scan near the end.
        // in unsafe implementation this would be dangerous as we potentially
        // will be reading more bytes than we should.

        this._longs = new ulong[(_length + 7) / 8];
        Buffer.BlockCopy(bytes, 0, _longs, 0, _length);
        var c = (((ulong)delimiter << 8) + (ulong)delimiter);
        c = (c << 16) + c;
        // comparer is now 8 copies of the original delimiter.
        c |= (c << 32);
        this._comparer = c;
    }

    public bool MoveNext()
    {
        if (this._start >= this._length) return false;
        int i = this._start;
        var longs = this._longs;
        var comparer = this._comparer;
        var record = this._record;
        record.id++;
        // handle the case where start is not divisible by eight.
        for (; (i & 7) != 0; i++)
        {
            if (i == _length || _bytes[i] == (comparer & 0xFF))
            {
                record.item = new byte[(i - _start)];
                Buffer.BlockCopy(_bytes, _start, record.item, 0, i - _start);
                _start = i + 1;
                return true;
            }
        }

        // main loop. We crawl the array 8 bytes at a time.

        for (int j=i/8; j < longs.Length; j++)
        {
            ulong t1 = longs[j];
            unchecked
            {
                t1 ^= comparer;
                ulong t2 = (t1 - 0x0101010101010101) & ~t1;
                if ((t2 & 0x8080808080808080) != 0)
                {
                    i =j*8;
                    // make every case 3 comparison instead of n. Potentially better. 
                    // This is an unrolled binary search.
                    if ((t2 & 0x80808080) == 0)
                    {
                        i += 4;
                        t2 >>= 32;
                    }

                    if ((t2 & 0x8080) == 0)
                    {
                        i += 2;
                        t2 >>= 16;
                    }

                    if ((t2 & 0x80) == 0)
                {
                i++;
                }
                record.item = new byte[(i - _start)];
                // improve cache locality by not switching collections.
                Buffer.BlockCopy(longs, _start, record.item, 0, i - _start);                _start = i + 1;
                return true;
            }
        }
        // no matches found increment by 8
    }
    // no matches left. Let's return the remaining buffer.
    record.item = new byte[(_length - _start)];
    Buffer.BlockCopy(longs, _start, record.item, 0, (_length - _start));
    _start = _bytes.Length;
    return true;
    }

    void IEnumerator.Reset()
    {
        throw new NotImplementedException();
    }

    public object Current
    {
        get
        {
            return this._record;
        }
    }
}

// We use a class to avoid boxing .
class Record
{
    internal int id;
    internal byte[] item;
}

possible duplicate of [Array bounds check efficiency in .net 4 and above](http://stackoverflow.com/questions/16713076/array-bounds-check-efficiency-in-net-4-and-above) — ClickRick, May 23 '14 at 22:07
I don't know if its exactly the same. The code emitted when I run in release x64 for the critical loop crawl performs 8 bound checks on the byte scan (that fun bit of `if(bytes[i] == _delim)`. In fact it generates quite a bit of code to perform the comparison which is somewhat alarming. I'm wondering if its worthwhile to revisit with a bit twiddling on the bytes crawl. — Michael B, May 23 '14 at 22:24
The .NET JIT is not a good optimizing compiler and it only eliminates bounds checks in the primitive case of a 0-to-length array traversal (and very simple variations). You're out of luck. (Or, use unsafe code. But I have seen that inhibit other optimizations and make things slower). — usr, May 24 '14 at 15:04
In which case you're into "fast, safe, cheap: pick 2 of the 3" territory. Measure your actual performance and decide whether it's actually unacceptable. The cost of speeding it up might not be justified. — ClickRick, May 25 '14 at 10:28
Certainly correct. I decided to create more statistics and a more realistic baseline to figure out what my algorithm was really choking on (I/O as can be expected was the unbelievably dominant part of the equation). I'm now fighting in the space of hundreds of milliseconds but its bothersome that something that should be faster isn't. Full repro here and test code for SQL Server 2008+: https://gist.github.com/mburbea/e72151af503873d82d6f — Michael B, May 27 '14 at 19:37
have you considered a DynamicMethod and writing your own IL equivalent of assembly language? — Les, Jun 02 '14 at 23:12
Unfortunately, sql clr does not allow emitting dynamic code. But even if I were to precompile the IL, I don't think the IL level access would help either. PEVerify would flag the unsafe behavior. (E.g. emitting a ldelem.i8 on a byte array would cause it to complain.), nor can you use pointers I don't believe without being flagged by PEVerify. The Jit would still insert bound check before my array accesses. — Michael B, Jun 03 '14 at 14:09
With all the stuff that is going on there how come you think that a bounds check is relevant to performance? It's two instructions one of which a 100% correctly predicted branch. This is a cost of 0-2 cycles per check. Almost zero. — usr, Jul 13 '14 at 19:21

score 0 · Answer 1 · answered Jul 13 '14 at 19:15

Thinking outside of the box, have you considered converting your strings to XML and using XQuery to do the split?

For example, you could pass in the delimiter and (air code):

DECLARE @xml as xml
DECLARE @str as varchar(max)
SET @str = (SELECT CAST(t.YourBinaryColumn AS varchar(max) FROM [tableName] t) 
SET @xml = cast(('<X>'+replace(@str,@delimiter,'</X><X>')+'</X>') as xml)

This converts the binary to a string and replaces the delimiter with XML tags. Then:

SELECT N.value('.', 'varchar(10)') as value FROM @xml.nodes('X') as T(N)

will get the individual "elements", the data between each delimiter occurrence.

Maybe this idea could be useful as-is or as a catalyst from which you can build on.

The performance of this approach is much worse than a CLR splitter or even a tally table splitter. See this post:: http://sqlblog.com/blogs/paul_white/archive/2012/09/05/compute-scalars-expressions-and-execution-plan-performance.aspx Plus that is only one row at a time. I want to be able to process a range of rows at a time. — Michael B, Jul 13 '14 at 21:25

How can I eliminate array bound checking on this loop vectorization?

1 Answers1