3

I need to search a potentially very large structure of byte data (up to 4GB) for a given hex value. The issue is that the string input of hex values can be of any size including an odd number, e.g. "ABC". Instead of converting the byte data to string data and performing a string search (which I've done FWIW), I'm wondering about a possibly better performing algorithm that searches essentially nibble-by-nibble.

Given the size of source of data, the search may be performed in segments, of course. The output is ideally an enumerable of the starting indexes or addresses of the matches. For the purposes of this question, let's assume the data structure is a byte[] and output is an int index and a nibble offset (e.g. bool firstNibbleMatch).

Before undertaking the task of trying it out, any ideas regarding profiling, the costs of the shift operations vs. ascii conversions (whether via C# library or roll-you-own unsafe code, etc.), or any other unforeseens? My main concern is performance. Would this approach even perform better in theory?

sean
  • 367
  • 2
  • 12
  • 1
    How long are your patterns? What about (text) data size? – MBo Sep 28 '18 at 17:08
  • 1
    As written the question seems rather broad and open ended. Consider adding more detail about the input format, output format, and the structure being searched. Also, presenting two approaches, and asking a specific question about those, would certainly help narrow the question. – user3386109 Sep 28 '18 at 17:15
  • See following posting : https://stackoverflow.com/questions/283456/byte-array-pattern-search – jdweng Sep 28 '18 at 17:23
  • Ok, hope edits clear up some of the vagueness. – sean Sep 28 '18 at 17:31
  • 2
    For the example input in the question, I would first scan the entire `byte[]` array for `0xAB`. Every time that's found, check the next byte with `if ((array[i+1] & 0xF0) == 0xC0)`. Then rescan the entire array for `0xBC`. When found check the previous byte with `if ((array[i-1] & 0x0F) == 0x0A)` – user3386109 Sep 28 '18 at 17:50
  • 1
    You could just implement both your input and pattern to bytes and then find substrings with Knuth-Morris-Pratt algorithm. The complexity would be minimal: `O(n+k)` where `n` and `k` are the lengths of your input and pattern. – Yeldar Kurmangaliyev Oct 01 '18 at 10:46

0 Answers0