2

im trying to use a regex matching a byte sequence of a header in a very big binary file.

My regex looks like this:

Pattern pattern = Pattern.compile("\u6484\u7194\u0018\u608c\u0e86\u7194");

or

Pattern pattern = Pattern.compile("\\x64\\x84\\x71\\x94\\x00\\x18\\x60\\x8c\\x0e\\x86\\x71\\x94");

Following this pattern it should select the next 512 byte including the pattern and output them into a variable (byte[] or char[]) like ...\\u7194.{250} or ...\\x94.{500}

There are a couple of ways to implement this. I dont want to buffer the entire file into an byte[] to match my pattern because the files may be several gigabytes. Iterating through every single byte and waiting for the pattern works, but is extremly slow and not realistic depending on the file size. I also dont want to cut the file into chunks because i would have to prepare for edge cases where the wanted 512 bytes are between two chunks.

Matching a pattern on a stream of the byte data would be ideal, but sadly i couldnt find a way without interpreting it into a String first. For example, using a scanner with the file as input can match a regex on the entire file (with Scanner.findWithinHorizon(String pattern,int horizon)), but sadly this works only for character Data. Transforming the data into a charsequence will change the content and make pattern matching using \x or \u impossible. It will only match hex that actually lead to printable characters like the first \x64 matching the character "d"

Is there a way to match my hex pattern intelligently onto big files without seperating the file or iterating byte-wise using Java (ideally 1.8)? Like the scanner example just without transforming the representation

Here an example of the 512 bytes from a binary file that should be selected

Yaldabaoth
  • 39
  • 2
  • Does this option - https://stackoverflow.com/a/31308505/4158037 help? You can check out the https://stackoverflow.com/a/29102101/4158037 as well. But looks like the implementation reads 1 byte at a time – Prasanna Nov 15 '20 at 14:59

0 Answers0