N.B. EDITED following clarification in comments
It may be possible to find your multi-line blocks using Regex - you can certainly write a Regex that would match them e.g. .*\[MAP\]((?s).*)\[\\MAP\]
- noting that in Java you would also have to escape all the \
characters and that (?s)
allows the .
character to match newlines i.e.
String mapBlockRegex = ".*\\[MAP\\]((?s).*)\\[\\\\MAP\\]";`
However - as you have highlighted - this produces difficulties if the file won't fit in memory and splitting it also has some difficulty.
I'll propose a different idea - scan the file line by line and use a state variable to indicate whether you are in a block or not. Basic algorithm as follows
- When you match the start of the block, set the state variable true.
- While state is true, append the text to a
StringBuilder
- When you match the end of block, set the state variable false and use the
String
you have built up, e.g. output it to file, to the console or use it programatically.
Java solution
I'll suggest one way to implement the above - using a Scanner
- which goes through a stream line by line, discarding them as it goes, thus avoiding OutOfMemoryError
. Note this code can throw exceptions - I have thrown them on, but you could put them in a try..catch..finally
block. Also note that Scanner
swallows IO Exception but, as the docs say, if this is important to you:
The most recent IOException thrown by the underlying readable can be retrieved via the ioException() method.
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class LogScanner
{
public static void main(String[] args) throws FileNotFoundException
{
FileInputStream inputStream = null;
Scanner sc = null;
String path = "D:\\hadoopTest.log";
String blockStart= ".*\\[MAP\\].*";
String blockEnd = ".*\\[\\\\MAP\\].*";
boolean inBlock = false;
StringBuilder block = null;
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
if (line.matches(blockStart)) {
inBlock = true;
block = new StringBuilder();
}
if (inBlock) {
block.append(line);
block.append("\n");
}
if (line.matches(blockEnd)) {
inBlock = false;
String completeBlock = block.toString();
System.out.println(completeBlock);
// I'm outputting the blockto stdout, you could append to a file\whatever.
}
}
sc.close();
}
}
Caveat Your file may have characteristics where this won't work without some adaptations. If you can have nested [map]
blocks, then inBlock
will need to be an int, where you increment if you match the block start and decrement if you match the end - appending for any inblock > 0
and only outputting the complete string when inBlock
goes to zero.
Command line split where seeking matches on a single line
If you were searching on a per-line basis and matches were guaranteed to be on a single line, then splitting would be OK as long as the splits only happened at the end of complete lines.
In that case, you could use the command line to split down the file. If you're on Linux (or, I think, any *nix) you can use the split command, e.g.
split --lines=75000
There is more detail in this question and answer
On Windows there isn't an equivalent command that I'm aware of, but you can install things that will do similar - e.g. GNU CoreUtils for Windows or 7-Zip. Caveat: I've never used these for splitting.