2

I have binary files that consist of many 24 byte packets where the first 8 bytes in each packet represent the serialized timestamp of type DateTime. The packets are all ordered in ascending fashion by timestamp. I want to develop a binary search algorithm that picks the first 8 bytes, deserializes the timestamp, and compares it with the desired timestamp.

Goal is to find the position in the binary file that represents the starting position of the serialized timestamp that matches the desired timestamp.

Edit The data is in a binary file and not in a data structure, hence List<T>.BinarySearch() won't work for me. But might it be possible to use BinarySearch on a Stream with CustomComparer?

The file contains many dozens of millions of such packets hence a simple iteration through the file would be highly inefficient. I consider a binary search approach.

Matt
  • 7,004
  • 11
  • 71
  • 117
  • 2
    The problem using a list of structures is you must read all the data before searching, may be just reading the stream until you find what you want will be more efficient – Gusman Jan 27 '17 at 12:43
  • 1
    @Codor, a little more colour perhaps? – Matt Jan 27 '17 at 12:43
  • @Gusman, sorry will edit, I did not mean to search a list as the data is in the binary file and the whole exercise is to circumvent having to read the entire file or from the beginning until the time stamp is found. I want to be more efficient – Matt Jan 27 '17 at 12:44
  • Then just use a binary reader, read the 8 bytes timestamp and compare it with the desired timestamp, if it's not the correct one skip 16 bytes with Seek and repeat, I can't think in anything more efficient than that. – Gusman Jan 27 '17 at 12:46
  • @Gusman, it is a file that contains dozens of millions of packets. It would be highly inefficient. What I am seeking is either a binary search algorithm or similarly efficient approach. – Matt Jan 27 '17 at 12:47
  • Is the data in the file already sorted by the timestamp? – Codor Jan 27 '17 at 12:51
  • well, then you must roll your own binary search algorithm, instead of using an array and index to search you must use the offsets at the file to retrieve the elements. – Gusman Jan 27 '17 at 12:52
  • 1
    yes it is sorted by timestamp – Matt Jan 27 '17 at 12:52
  • A binary search (or pivot search as it was called eons ago) requires reading all of the data before the search begins. It also requires the data to be sorted, so if it is not, you have to sort before beginning the search. You pick a pivot point as close as possible to the center of the data and do the first compare, if the desired data is before the pivot you keep everything below (same if it is above, you keep everything above). Find the next pivot close as possible to the center of the remaining data. The efficiency comes in the fact that at each compare, you are eliminating half of the data – Kevin Jan 27 '17 at 12:53
  • 1
    I think a custom implementation of binary search is necessary then; you would perhaps need a stream in which the reading cursor can be repositioned, as [here](https://msdn.microsoft.com/de-de/library/system.io.filestream.seek(v=vs.110).aspx); perhaps a so-called [memory-mapped file](https://msdn.microsoft.com/de-de/library/system.io.memorymappedfiles.memorymappedfile(v=vs.110).aspx) can also help. – Codor Jan 27 '17 at 12:53
  • 1
    Yes, that is what I am thinking, to move the position in the base stream of a binary reader back and forth. I was just hoping for a `CustomComparer` to save me the binary-search implementation part. – Matt Jan 27 '17 at 12:56
  • 1
    Apparently not, but [snippets](http://codereview.stackexchange.com/questions/6338/generic-binary-search) for binary search online are dime a dozen. – Codor Jan 27 '17 at 12:58
  • 1
    Just a crazy idea, suppose you create an IList which takes on it's constructor a file name and creates a stream to the file, this IList provides as Count the file size / 24 and when you retrieve the element it Seeks the stream to 24 * Index, reads the 8 bytes and returns it as a long, in this way you can use all the default facilities from .net to do element searches, you can even use Linq! – Gusman Jan 27 '17 at 12:58
  • @Kevin it does not require reading all data before beginning the search. packets are already sorted and constant 24 bytes. assuming there is no header the number of packets can be measured very simple by dividing size of the binary file by 24. if there is header then size of file minus size of header divided by 24. – M.kazem Akhgary Jan 27 '17 at 12:59
  • I'm not sure, but I believe that a memory mapped file supports lazy loading of data. – Codor Jan 27 '17 at 13:00
  • @Gusman, very much like your idea though I doubt the stream can be fed/diverted to an IList as I believe it requires all elements to be present/evaluated. – Matt Jan 27 '17 at 13:02
  • @M.kazemAkhgary, that is accurate. There is a header but that can be easily adjusted for with an offset in `binaryReader.BaseStream.Seek(offset, ...)` – Matt Jan 27 '17 at 13:04
  • implementing binary search isn't hard at all. look at Wikipedia for sudo code – M.kazem Akhgary Jan 27 '17 at 13:04
  • no, the IList internally has the stream, you must implement manually the ILIst interface on your own class – Gusman Jan 27 '17 at 13:04
  • Also, here is a custom implementation of a binarysearch that you can use if you implement the IList: http://stackoverflow.com/questions/967047/how-to-perform-a-binary-search-on-ilistt – Gusman Jan 27 '17 at 13:06
  • thanks, I feel intrigued about the IList approach. – Matt Jan 27 '17 at 13:10

2 Answers2

3

Haven't tested it, but the point is to read 8 bytes at in the middle of the file, than move to right or left middle and repeat, depending on read timestamp. (not the cleanest code). Complexity would be Log(N)

public class BinaryFinder
{
    private readonly long _packagesCount;
    private readonly FileStream _reader;

    public BinaryFinder(FileStream reader, int packageSize)
    {
        _reader = reader;
        _packagesCount = reader.Length / packageSize;
    }


    public long Find(DateTime dateToSearch)
    {
        return Find(0, _packagesCount, dateToSearch);
    }

    private long Find(long minPosition, long maxPosition, DateTime dateToSearch)
    {
        while (minPosition<=maxPosition) {
            var newPosition = (minPosition + maxPosition) / 2;
            var readDate = ReadDateAt(newPosition);

            if (readDate == dateToSearch) {
                return newPosition;
            }

            if (dateToSearch < readDate){
                maxPosition = newPosition-1;
            }
            else {
                minPosition = newPosition+1;
            }
        }

        return -1;
    }

    private DateTime ReadDateAt(long middlePosition)
    {
        var buffer = new byte[8];

        _reader.Seek(middlePosition, SeekOrigin.Begin);
        _reader.Read(buffer, 0, buffer.Length);

        var currentDate = ConvertBytesToDate(buffer);
        return currentDate;
    }

    private static DateTime ConvertBytesToDate(byte[] dateBytes)
    {
        throw new NotImplementedException();
    }
}
Artiom
  • 7,694
  • 3
  • 38
  • 45
  • 1
    This is recursive and the user stated in the file can be millions of items, wouldn't it cause an StackOverflow on the worst cases? – Gusman Jan 27 '17 at 13:08
  • 1
    @Gusman it can be done without recursion too. Good point – Artiom Jan 27 '17 at 13:09
  • 1
    Yes, I know it can be done without recursion and just a loop, only was stating the problem in case the user would try it. – Gusman Jan 27 '17 at 13:10
  • @Gusman no it will not cause StackOverflow exception. binary search has time complexity of `O(Log n)`, Log 1000000 is 6 that means maximum of 6 calls and you find your item! its pretty amazing isn't it? – M.kazem Akhgary Jan 27 '17 at 13:17
  • @M.kazemAkhgary it's 6, for base 10. In the sample above base is 2. Therefore it will be around 20 steps maximum – Artiom Jan 27 '17 at 13:19
  • oh. sorry! yes 20 steps but still wont cause SO exception – M.kazem Akhgary Jan 27 '17 at 13:21
  • Yes, @M.kazemAkhgary is right, even if is log2, I didn't took the time to calculate it, just seeing a recursive function over millions of elements triggered my inner alarms, but they were wrong :D – Gusman Jan 27 '17 at 13:22
  • @Artiom, are you perhaps looking to pass in an instance of `BinaryReader` rather than FileSteam? I am asking because the `Read` method for `BinaryReader` takes as 2nd parameter an index which I believe you intend to use. The Read method looks for an offset in the to be populated byte array, however. – Matt Jan 27 '17 at 14:15
  • @MattWolf by bad. Actually in both cases index/offset is rekatively to buffer. So you via `Seek` method you can change current position of the stream. I've updated the answer. – Artiom Jan 27 '17 at 14:22
  • Awesome. Thanks a lot for your solution. – Matt Jan 27 '17 at 14:23
  • @Artiom, btw you got your sign reversed: `if(currentTimeStamp < dateToSearch)` ought to be `if(currentTimeStamp > dateToSearch)` – Matt Jan 27 '17 at 15:25
  • @MattWolf thanks, updated the code – Artiom Jan 27 '17 at 15:27
  • @Artiom, sorry to be pedantic (though its important), but you need to adjust the minPosition and maxPosition up by 24 and down by 24, respectively. Else you may never exit the loop. – Matt Jan 27 '17 at 15:31
  • @MattWolf updated the sample – Artiom Feb 15 '17 at 06:52
1

Ok, here is the crazy idea in code, check it, it will return the index of the structure for the timestamp you're seeking.

Just instantiate a FileStructList(fileName) and then do list.BinarySearchIndexOf(theTimeStamp);

You can even pass it your own comparer :)

This includes the binary search on the code, but as it's an IList you can use any search method available for collections.

public class FileStructList : IList<long>
{

    Stream baseStream;
    BinaryReader reader;
    int length;
    int headerSize;

    public FileStructList(string FileName, int HeaderSize)
    {
        baseStream = File.OpenRead(FileName);
        reader = new BinaryReader(baseStream);
        length = (int)((baseStream.Length - HeaderSize) / 24);
        headerSize = HeaderSize;
    }

    public long this[int index]
    {
        get
        {
            baseStream.Seek(24 * index + headerSize, SeekOrigin.Begin);
            return reader.ReadInt64();
        }

        set
        {
            throw new NotImplementedException();
        }
    }

    public int Count
    {
        get
        {
            return length;
        }
    }

    public bool IsReadOnly
    {
        get
        {
            return true;
        }
    }

    public void Add(long item)
    {
        throw new NotImplementedException();
    }

    public void Clear()
    {
        throw new NotImplementedException();
    }

    public bool Contains(long item)
    {
        return BinarySearchIndexOf(item) != -1;
    }

    public void CopyTo(long[] array, int arrayIndex)
    {
        throw new NotImplementedException();
    }

    public IEnumerator<long> GetEnumerator()
    {
        throw new NotImplementedException();
    }

    public int IndexOf(long item)
    {
        return BinarySearchIndexOf(item);
    }

    public void Insert(int index, long item)
    {
        throw new NotImplementedException();
    }

    public bool Remove(long item)
    {
        throw new NotImplementedException();
    }

    public void RemoveAt(int index)
    {
        throw new NotImplementedException();
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        throw new NotImplementedException();
    }

    public Int32 BinarySearchIndexOf(long value, IComparer<long> comparer = null)
    {
        comparer = comparer ?? Comparer<long>.Default;

        Int32 lower = 0;
        Int32 upper = length - 1;

        while (lower <= upper)
        {
            Int32 middle = lower + (upper - lower) / 2;
            Int32 comparisonResult = comparer.Compare(value, this[middle]);
            if (comparisonResult == 0)
                return middle;
            else if (comparisonResult < 0)
                upper = middle - 1;
            else
                lower = middle + 1;
        }

        return -1;
    }
}
Gusman
  • 14,905
  • 2
  • 34
  • 50
  • How would I account for an offset? (I store header information in the first 10,000 bytes. ) – Matt Jan 27 '17 at 13:44
  • 1
    Let me modify the code, is simple, pass the offset at the constructor and sum it to the seek offset – Gusman Jan 27 '17 at 13:45
  • This is a very neat solution. Thank you!!! – Matt Jan 27 '17 at 13:57
  • Glad you like it :). You can even extend it to modify the timestamps implementing the set or implement the GetEnumerator to allow foreach() on the elements. I know adding foreach() support and so on is not efficient for your final target, but can be a neat feature if you want to manipulate the data on the file. – Gusman Jan 27 '17 at 13:59
  • Yes that is what I like about the solution, its quite extendable. I am saying that because I also look to implement packet inserts but would need to check whether an open stream can be adjusted in size or whether I would need to create a new file if I wanted to insert elements which would enlarge the file size – Matt Jan 27 '17 at 14:05
  • I marked your answer as solution because I like your approach and because the other solution contained too many errors. I have not tested your code yet but `Int32 middle = lower + (upper - lower) / 2;` looks a bit funny. Should it not simply be `middle = (lower + upper)/2`? (I guess you copied code from a problem that assumes a descending order...?) – Matt Jan 27 '17 at 15:34
  • The binary code is from the post I pasted on the comments, but I think the code is correct, it takes the lower bound and increases to the difference between them divided by two, it "folds" on each iteration the data. – Gusman Jan 27 '17 at 15:51
  • Sorry, am a bit tired, its actually the same. Sorry – Matt Jan 27 '17 at 16:01
  • Np, I know how a problem can get your head messed up :D – Gusman Jan 27 '17 at 16:02