-1

I have a file that contains a sorted list of key/value pairs. The key is a fixed-length string, and the value is the binary representation of an integer. So if the key size were 6, each record is exactly 10 bytes (6+sizeof(int)). The key size can be any (reasonable) value but is the same for every record in the file.

For example, the data file might look like this (where 'xxxx' would be binary representations of int values):

ABLE  xxxxBAKER xxxxCARL  xxxxDELTA xxxxEDGAR xxxxGEORGExxxx

I would like to load this table into memory (a byte[]) and binary search it. In C++, this is a snap, the bsearch() function takes an 'element size' parameter that tells it the size of the records in the buffer, but I cannot find a similar function in C#.

I've also looked at Array.BinarySearch, but I can't figure out what I'd use as a T.

Marc Bernier
  • 2,928
  • 27
  • 45
  • 1
    If you want to organize them into an array of `byte[]` (that is, an array of byte arrays), then the `T` parameter in `Array.BinarySearch()` would be `byte[]` – Mathias R. Jessen Nov 30 '21 at 15:24
  • I assume you mean "binary search" in the sense of looking for unsorted binary data, and not in the sense of https://en.wikipedia.org/wiki/Binary_search_algorithm – Matthew Watson Nov 30 '21 at 15:25
  • If the former, then you can use an implementation of Boyer-Moore, such as the sample one I posted here: https://stackoverflow.com/a/37500883/106159 – Matthew Watson Nov 30 '21 at 15:26
  • 1
    @MatthewWatson, no it is sorted. I've added an example of what a file could look like. – Marc Bernier Nov 30 '21 at 15:32
  • In C# this is not a snap, as the way these bytes are laid out doesn't correspond to how managed types are laid out. It is possible to (say) declare a `struct` that more or less wraps accurately over it and use `Span`s to access it, but this at best reduces memory use, and still incurs overhead for decoding the strings and comparing them, so it would not generally win over simply parsing the file -- being "binary" doesn't give it any kind of special status there. – Jeroen Mostert Nov 30 '21 at 15:48
  • Are the key characters subject to any form of cultural differences in handling, like do you have characters like Ü in there, or ò or whatnot? – Lasse V. Karlsen Nov 30 '21 at 15:50
  • @LasseV.Karlsen, for the sake of the example, we can assume it's just ASCII. – Marc Bernier Nov 30 '21 at 16:00
  • Here's a super-unoptimized example. https://dotnetfiddle.net/3nXXGd – Lasse V. Karlsen Nov 30 '21 at 16:40

2 Answers2

1

You should probably use a binary reader to convert your binary data to proper .net types. Since c# is a managed language, and the memory representation of types may depend on the runtime, reading and writing data is generally more involved than just dumping some part of memory to disk.

You mention "fixed-length string", I would assume you mean it is ascii encoded. C# string are UTF16, so you will probably need to convert it to allow it to be used in any meaningful way. See also bitconverter for an alternative way to read integer data.

Once you have converted the data to proper c# types it should be fairly trivial to do a binary search, just put the key and value in separate arrays and use Array.BinarySearch. Or use a custom type that implements IComparable<T>.

Or put the data into some other data structure, like a dictionary, for fast searching.

JonasH
  • 28,608
  • 2
  • 10
  • 23
1

We can modify standard binary-search algorithms to do this.

We need to modify it to refer to the array in Spans of the width you want. So we multiply the index by that width.

delegate int BytesComparer(ReadOnlySpan<byte> a, ReadOnlySpan<byte> b);

int BinarySearchRecord(byte[] array, int recordWidth, ReadOnlySpan<byte> value, BytesComparer comparer)
{
    return BinarySearchRecord(array, 0, array.Length / recordWidth, value, recordWidth, comparer);
}

int BinarySearchRecord(byte[] array, int startRecord, int count, int recordWidth, ReadOnlySpan<byte> value, BytesComparer comparer)
{
    int lo = startRecord;
    int hi = startRecord + count - 1;
    while (lo <= hi)
    {
        int i = lo + ((hi - lo) >> 1);
        int order = comparer(new ReadOnlySpan<byte>(array, i * recordWidth, recordWidth), value);
 
        if (order == 0) return i;
        if (order < 0)
            lo = i + 1;
        else
            hi = i - 1;
    }
 
    return ~lo;
}

You can use it like this (it requires some bit-twiddling on little-endian systems, in order to sort strings correctly)

var yourArray = Encoding.ASCII.GetBytes("ABLE  xxxxBAKER xxxxCARL  xxxxDELTA xxxxEDGAR xxxxGEORGExxxx");

var valToFind = Encoding.ASCII.GetBytes("DELTA ").Concat(new byte[]{0x1, 0x2, 0x3, 0x4}).ToArray();

var foundIndex = BinarySearchRecord(yourArray, valToFind.Length, new ReadOnlySpan<byte>(valToFind), (a, b) =>
    {
        return BinaryPrimitives.ReverseEndianness(BitConverter.ToInt64(a) & 0xffffffffffff)
            .CompareTo(
               BinaryPrimitives.ReverseEndianness(BitConverter.ToInt64(b) & 0xffffffffffff));
    });

dotnetfiddle

Charlieface
  • 52,284
  • 6
  • 19
  • 43