Is there any efficient method to find the first byte of all instances of a particular 4 byte block in a file?

Question

I have files that contains archived binary messages. A small file is around 600MB and contains nearly 9000 messages. Each message begins with a particular four byte flag that I know, which indicates the first four bytes of the message header (and as such must be captured). The message header is a fixed size for all messages. The message header is followed by a payload of a size that is identified in the header. Once I've found the start of a particular message header, I know how many bytes to the end of the header and can use that to extract the number of bytes in the message I need to parse this archive file and isolate each message for processing, making sure that I include all bytes from the first byte of the four byte flag to the end of the specified message length. There is some padding between the messages that varies.

Due to the size of the file, I don't want to (and probably can't in all cases) consume the file as a single array. Therefore, I'm looking at things like RandomAccessFile and FileInputStream. It doesn't seem like it's a simple task to scan a file for a particular sequence of bytes and then take every byte from the first byte in that sequence through a known length. RandomAccessFile, especially the read(byte[]) and seek() methods seem like they will allow me to implement a solution.

To give an idea, my current implementation involves a method called findFlag() that takes a start position in the RandomAccessFile. It seeks to that position and reads the four bytes starting there. If it finds the flag, it returns startPos. Otherwise, it calls itself recursively, moving to startPos + 1 and repeats until it finds the flag. Since I know the last byte I read as part of the data message, I would start seeking there:

file.seek(startPos);

byte[] possibleFlag = new byte[4];

file.read(possibleFlag, 0, possibleFlag.length);

if (Arrays.equals(ByteUtils.intToBytes(Message.FLAG), possibleFlag)) {
    return startPos;
}
else {
    return findFlag(startPos + 1);
}

Am I overlooking something, either in Java (Java 6 or earlier) or in a well-tested external library (such as an Apache library or similar)? If not, are there better solutions for dealing with binary data in Java or any approaches that are particularly well-suited for my problem?

Have you looked at http://stackoverflow.com/questions/644737/are-there-any-java-frameworks-for-binary-file-parsing ? Preon seems to be something you might consider. — Ewald, Jun 05 '12 at 15:00
@Ewald Not sure how helpful Preon would be. Until I isolate the messages, there's no consistent format for the file I'm reading. The only given is the fact that the same four byte sequence denotes the start of every message in the file. — Thomas Owens, Jun 05 '12 at 15:07
I think *something* like what you're currently doing is pretty efficient. Just read in the stream of bytes checking for the marker bytes etc. What else would you want? — Torious, Jun 05 '12 at 15:22
@Torious It just seems really awkward, clumsy, and potentially inefficient. Maybe there is no better solution. For obvious reasons, I don't have ready access to the large data files, so I'm playing with 500-700MB test files to ensure that my application's logic is correct. — Thomas Owens, Jun 05 '12 at 15:25
Maybe it is better to skip more bytes when you are sure the magic string cannot be there; ie, as you read 4 bytes per time, if magic is CAFE and you get BABE, you can directly move forward 4 bytes. — guido, Jun 05 '12 at 15:25
I wouldn't do this recursively, that's for sure. Other than that, what you're doing is roughly right. You can save a few reads by using a more sophisticated string-searching algorithm, like Boyer-Moore, but i doubt it will give any measurable gain in performance. — Tom Anderson, Jun 05 '12 at 15:54

johan · Answer 1 · 2012-06-05T15:41:03.103

2

scan through the file using java.nio.channels.FileChannel it uses less intermediate copies to map a file into memory. benchmark of alternatives

edited Jun 05 '12 at 15:41

answered Jun 05 '12 at 14:58

johan

21
2

There doesn't seem to be that much of a difference between `RandomAccessFile` and `FileChannel` in terms of functionality provided. Either way, I'd still be rolling my own "search file for next index of byte sequence" method. – Thomas Owens Jun 05 '12 at 15:03
I have now posted a code snippet as to how I'm using `RandomAccessFile`. How would using a `FileChannel` be an improvement over my existing code? – Thomas Owens Jun 05 '12 at 15:13

score 1 · Answer 2 · answered Jun 05 '12 at 15:46

This whole approach seems invalid. How do you know the magic bytes won't appear somewhere else? For example in the payload or in the padding. I hope you are taking this into account.

Get rid of recursion. Java doesn't do tail call elimination. Iterative version should be clearer and faster.

Limit the number of allocations. Allocating two arrays per every byte in file is completely unacceptable.

You don't have to worry about buffer sizes and allocations if you use FileChannel. You can iterate through the file using MappedByteBuffer.getInt(int) and compare it to Message.FLAG. That's just one simple for loop.

score 0 · Answer 3 · answered Jun 05 '12 at 21:28

This seems to me horribly inefficient. The most expensive operation on a file is the random part - moving the internal pointer back and forth. And you do that for every single byte. +4, -3, +4, -3, etc... performance death waltz. You could perfectly do it with just forward movement. Start searching just for the first byte of signature instead of the whole sequence. If match, test the next byte. In case of any fail, just restart searching for the first byte. 4 successes in row means you have your signature. All the time, you just keep going forward. Avoid seeks at any cost.

Also, unless you absolutely don't care how long your processing will take, you should not dismiss FileChannel just on the grounds of functionality. The referenced stat is talking about MINUTES per 100MB and i can support that observation. FileChannel is TWO ORDERS OF MAGNITUDE faster than RandomAccessFile with small read sizes - you need the smallest one :)

And while recursion is generally considered a sign of fearless programmer, this particular usage could easily blow the lid off your VM if you feed it several hundreds of MBs of data which do not contain any signature.

Is there any efficient method to find the first byte of all instances of a particular 4 byte block in a file?

3 Answers3