I have a task to split multiple runs of varbinary(8000) columns from a database table on the binary literal 0x0. However, this might change so I'd so I'd like to keep this variable. I'd like to use SQLCLR to do this quickly as a streaming table value function. I know my strings are always going to be at least a few thousand bytes each.
Edit: I've updated my algorithm. To avoid the nastiness of the inner loop unrolling. But its extremely hard to convince the CLR to make right choices about register allocation. It would be awesome if there was an easy way to convince the CLR that j and i are REALLY the same thing. But instead it does really dumb things. It would be nice to optimize the first path loop. But you can't use a goto into a loop.
I decided to adapt a 64-bit implementation of the C function memchr. Basically instead of scanning one byte at a time and comparing, I scan 8 bytes at a time using some bit twiddling. For reference, Array.IndexOf<Byte>
does something similar with a 4 byte scan for one answer, I just want to keep doing it.
There are a couple of things to note:
Memory pressure is a very real problem in SQLCLR functions.
String.Split
is out because up front it allocates a lot of memory that I'd really like to avoid. It also works on UCS-2 strings, which would require me to convert my ascii strings to unicode strings, and thus treat my data as a lob datatype on the return. (SqlChars
/SqlString
can only return 4000 bytes before being turned into a lob type).I'd like to stream. Another reason to avoid
String.Split
it does its work all at once, creating lots of memory pressure. On code with lots of delimiter pure T-SQL methods will start beating it.I'd like to keep it "Safe". So all managed. There seems to be a very large penalty in a security check.
Buffer.BlockCopy
is really fast, and it seems to be better to pay that cost up front once than constantly pay the cost for BitConverter. This is also still cheaper yet, than converting my input to string, and holding on to that reference.
The code is quite fast, but it seems that I'm paying for quite a few bound checks, in the initial loop and in the critical section when I find a match. As a result for code with lots of delimiters, I tend to lose to a simpler C# enumerator that just does a byte comparison.
Here is my code,
class SplitBytesEnumeratorA : IEnumerator
{
// Fields
private readonly byte[] _bytes;
private readonly ulong[] _longs;
private readonly ulong _comparer;
private readonly Record _record = new Record();
private int _start;
private readonly int _length;
// Methods
internal SplitBytesEnumeratorA(byte[] bytes, byte delimiter)
{
this._bytes = bytes;
this._length = bytes.Length;
// we do this so that we can avoid a spillover scan near the end.
// in unsafe implementation this would be dangerous as we potentially
// will be reading more bytes than we should.
this._longs = new ulong[(_length + 7) / 8];
Buffer.BlockCopy(bytes, 0, _longs, 0, _length);
var c = (((ulong)delimiter << 8) + (ulong)delimiter);
c = (c << 16) + c;
// comparer is now 8 copies of the original delimiter.
c |= (c << 32);
this._comparer = c;
}
public bool MoveNext()
{
if (this._start >= this._length) return false;
int i = this._start;
var longs = this._longs;
var comparer = this._comparer;
var record = this._record;
record.id++;
// handle the case where start is not divisible by eight.
for (; (i & 7) != 0; i++)
{
if (i == _length || _bytes[i] == (comparer & 0xFF))
{
record.item = new byte[(i - _start)];
Buffer.BlockCopy(_bytes, _start, record.item, 0, i - _start);
_start = i + 1;
return true;
}
}
// main loop. We crawl the array 8 bytes at a time.
for (int j=i/8; j < longs.Length; j++)
{
ulong t1 = longs[j];
unchecked
{
t1 ^= comparer;
ulong t2 = (t1 - 0x0101010101010101) & ~t1;
if ((t2 & 0x8080808080808080) != 0)
{
i =j*8;
// make every case 3 comparison instead of n. Potentially better.
// This is an unrolled binary search.
if ((t2 & 0x80808080) == 0)
{
i += 4;
t2 >>= 32;
}
if ((t2 & 0x8080) == 0)
{
i += 2;
t2 >>= 16;
}
if ((t2 & 0x80) == 0)
{
i++;
}
record.item = new byte[(i - _start)];
// improve cache locality by not switching collections.
Buffer.BlockCopy(longs, _start, record.item, 0, i - _start); _start = i + 1;
return true;
}
}
// no matches found increment by 8
}
// no matches left. Let's return the remaining buffer.
record.item = new byte[(_length - _start)];
Buffer.BlockCopy(longs, _start, record.item, 0, (_length - _start));
_start = _bytes.Length;
return true;
}
void IEnumerator.Reset()
{
throw new NotImplementedException();
}
public object Current
{
get
{
return this._record;
}
}
}
// We use a class to avoid boxing .
class Record
{
internal int id;
internal byte[] item;
}