0

I would like to use regex to increase the speed of my searches for specific records within a large binary image. It seems like regex searches always outperform my own search methods, so that's why I'm looking into this. I have already implemented the following, which works, but is not very fast.

My binary image is loaded into a Numpy memmap as words.

I_FILE = np.memmap(opts.image_file, dtype='uint32', mode='r')

And here is start of my search loop currently (which works):

for i in range(0, FILESIZE - 19):
    if (((I_FILE[i] + 1 == I_FILE[i + 19]) or (I_FILE[i - 19] + 1 == I_FILE[i])) and I_FILE[i] < 60):
    ...do stuff...

This is seeking out records that are 19 bytes long that start with a decimal sequence number between 0 and 59. It looks for an incrementing sequence on either a record before or after the current search location to validate the record.

I've seen a few examples where folks have crafted variables into string using re.escape (like this: How to use a variable inside a regular expression?) But I can't seem to figure out how to search for a changing value sequence.

Community
  • 1
  • 1
Dwiggit
  • 21
  • 4
  • Just a comment - the line `if (((I_FILE[i] + 1 == I_FILE[i + 19]) or (I_FILE[i - 19] + 1 == I_FILE[i])) and I_FILE[i] < 60):` tests at position -19 in the first iteration (the second test after the `or`). Is that intended? – SamWhan Jan 19 '17 at 14:25
  • @ClasG - I intentionally check for a valid record before OR after the current potential record. This ensures I have at least one instance of a sequential record at the beginning and the end of a set of records. The first instance of that line running would actually seek ahead of the image, but Python doesn't seem to care. – Dwiggit Jan 19 '17 at 14:31
  • I don't think this is feasible. It can be done, but then the regex would have to test for each increase/decrease separately. I.e. something like `\00[\x00-\xff]{18}\x01|\01[\x00-\xff]{18}\x00|\01[\x00-\xff]{18}\x02|\02[\x00-\xff]{18}\x01 ...` repeated up to `\x3b` would work, but probably not faster, and certainly not more readable than your current test. (I assume, as your code indicates, that you mean that the first byte is decimal 0-59, and not that it's a decimal string, like your description hints) – SamWhan Jan 19 '17 at 14:40
  • @ClasG - To clarify, the first word (4 bytes) is a decimal value stored in hex. Since my memmap is loaded as words, 'I_FILE[i]' will return the value of the word. Perhaps I could go about this by reducing my initial requirements. Maybe just have the regex search for two instances of values between 0 and 59 that are separated by 19 bytes. Then take the resulting list and check for incrementing values as a second step. – Dwiggit Jan 19 '17 at 14:47

1 Answers1

0

I managed to make it work with regex, but it was a bit more complicated than I expected. The regex expressions look for two values between 0 and 59 that are separated by 72 bytes (18 words). I used two regex searches to ensure that I wouldn't miss records at the end of a sequence:

# First search uses the lookahead assertion to not consume large amounts of data.
SearchPattern1 = re.compile(b'[\0-\x3B]\0\0\0(?=.{72}[\1-\x3B]\0\0\0)', re.DOTALL)
# Again using the positive lookbehind assertion (?<= ... ) to grab the ending entries.
SearchPattern2 = re.compile(b'(?<=[\0-\x3B]\0\0\0.{72})[\1-\x3B]\0\0\0', re.DOTALL)

Next, perform both searches and combine the results.

HitList1 = [m.start(0) for m in SearchPattern1.finditer(I_FILE)]
HitList2 = [m.start(0) for m in SearchPattern2.finditer(I_FILE)]
AllHitList = list(set(HitList1 + HitList2))
SortedHitList = sorted(AllHitList)

Now I run a search that has the same conditions as my original solution, but it runs on a much smaller set of data!

for i in range(0, len(SortedHitList)):
    TestLoc = SortedHitList[i]
    if (I_FILE[TestLoc] + 1 == I_FILE[TestLoc + 19]) or (I_FILE[TestLoc - 19] + 1 == I_FILE[TestLoc]):
        ... do stuff ...

The result was very successful! The original solution took 58 seconds to run on a 300 MB binary file, while the new regex solution took only 2 seconds!!

Dwiggit
  • 21
  • 4