Java : How to find string patterns in a LARGE binary file?

Question

I'm trying to write a program that will read a VERY LARGE binary file and try to find the occurrence of 2 different strings and then print the indexes that matches the patterns. For the example's sake let's assume the character sequences are [H,e,l,l,o] and [H,e,l,l,o, ,W,o,r,l,d].

I was able to code this for small binary files because I was reading each character as a byte and then saving it in an Arraylist. Then starting from the beginning of the Arraylist, I was comparing the byte arraylist(byte[] data) with the byte[] pattern.

I need to find a way to do the same but WITHOUT writing the entire binary file in memory for comparison. That means I should be able to compare while reading each character (I should not save the entire binary file in memory). Assume the binary file only contains characters.

Any suggestions on how this can be achieved ? Thank you all in advance.

What's about file encoding? It is a little bit strange to search for **text** in **binary** file. — Alexey Malev, Apr 28 '14 at 20:33
You may want to ground "VERY LARGE". One person's "very large" is another person's "not that large". — pamphlet, Apr 28 '14 at 20:43
Very Large as in my computer memory wont be able to read everything before trying to match the patterns. I would be out of memory before reading every character into memory... — user3582822, Apr 28 '14 at 20:53
The file is indeed encoded... i have to decode it and then match the patterns... instead of retrieving bytes im now retrieving the char directly by using the inputstream objects readChar() method instead of readByte(). — user3582822, Apr 28 '14 at 22:39

score 5 · Answer 1 · answered Apr 28 '14 at 20:37

5

Seems like you are really looking for Aho-Corasick string matching algorithm.

The algorithm builds an automaton from the given dictionary you have, and then allows you to find matches using a single scan of your input string.

The wikipedia article links to this java implementation

answered Apr 28 '14 at 20:37

amit

175,853
27
231
333

I was trying to steer the OP gently towards this and bang, you shot it down. :) – biziclop Apr 28 '14 at 20:42
@biziclop It doesn't sound like a homework question to me, and there is no point to reinvent the wheel if it's not. – amit Apr 28 '14 at 20:43
Other than, possibly, the experience of learning to build rounder wheels. :) (Great answer, BTW). – pamphlet Apr 28 '14 at 20:45
1

True but I always prefer to understand the tools I'm using. – biziclop Apr 28 '14 at 20:47
@biziclop I didn't invent the wheel and I still understand how it works... Not reinventing the wheel and understanding how the tools work are not mutually exclusive. – amit Apr 28 '14 at 21:22

score 2 · Accepted Answer · answered Apr 28 '14 at 20:34

2

Google "finite state machine".

Or, read the file one byte at a time, if the byte just doesn't match the first character of the search term, go on to the next byte. If it does match, now you're looking for the next character in the sequence. I.e., your state has gone from 0, to 1. If your state equals (or passes) the length of the search string, you found it!

Implementation/debugging left to the reader.

answered Apr 28 '14 at 20:34

pamphlet

2,054
1
17
27

Note that while the "finite state machine" is indeed the way to go, an efficient implementation that deals with big dictionaries is not at all trivial to do it in linear time. – amit Apr 28 '14 at 20:38
Awesome thank you !!! Forgot about finite state machines. I got it working for 1 pattern. Wonder if its going to be difficult to modify my current code so that it finds the second pattern as well. (the first pattern is a substring of the second...) – user3582822 Apr 28 '14 at 22:33

score 0 · Answer 3 · answered Apr 28 '14 at 20:35

There are specialised algorithms for this but let's try a simple one first.

You can start with making the comparison on the fly, always after reading the next byte. Once you do that, it's easy to spot that you don't need to keep any bytes that are from earlier than your longest pattern.

So you can just use a buffer that is as long as your longest pattern, put new bytes in at one end and drop them at the other.

As I said, there are algorithms more effective than this but it's a good start.

score 0 · Answer 4 · edited May 23 '17 at 11:49

Use a FileInputStream wrapped in a BufferedInputStream and compare each byte. Keep a buffer the length of the sequence you're looking for so you backtrack if it doesn't match at some point. If the sequence you're looking for is too large, you could save the offset and re-open the file for reading.

Working with streams: http://docs.oracle.com/javase/tutorial/essential/io/
String matching algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm

Or if you just want something to copy and paste you could look at this SO question.

Java : How to find string patterns in a LARGE binary file?

4 Answers4

Linked