I have hundreds of large (6GB) gziped log files that I'm reading using GZIPInputStream
s that I wish to parse. Suppose each one has the format:
Start of log entry 1
...some log details
...some log details
...some log details
Start of log entry 2
...some log details
...some log details
...some log details
Start of log entry 3
...some log details
...some log details
...some log details
I'm streaming the gziped file contents line by line through BufferedReader.lines()
. The stream looks like:
[
"Start of log entry 1",
" ...some log details",
" ...some log details",
" ...some log details",
"Start of log entry 2",
" ...some log details",
" ...some log details",
" ...some log details",
"Start of log entry 2",
" ...some log details",
" ...some log details",
" ...some log details",
]
The start of every log entry can by identified by the predicate: line -> line.startsWith("Start of log entry")
. I would like to transform this Stream<String>
into a Stream<Stream<String>>
according to this predicate. Each "substream" should start when the predicate is true, and collect lines while the predicate is false, until the next time the predicate true, which denotes the end of this substream and the start of the next. The result would look like:
[
[
"Start of log entry 1",
" ...some log details",
" ...some log details",
" ...some log details",
],
[
"Start of log entry 2",
" ...some log details",
" ...some log details",
" ...some log details",
],
[
"Start of log entry 3",
" ...some log details",
" ...some log details",
" ...some log details",
],
]
From there, I can take each substream and map it through new LogEntry(Stream<String> logLines)
so as to aggregate related log lines into LogEntry
objects.
Here's a rough idea of how that would look:
import java.io.*;
import java.nio.charset.*;
import java.util.*;
import java.util.function.*;
import java.util.stream.*;
import static java.lang.System.out;
class Untitled {
static final String input =
"Start of log entry 1\n" +
" ...some log details\n" +
" ...some log details\n" +
" ...some log details\n" +
"Start of log entry 2\n" +
" ...some log details\n" +
" ...some log details\n" +
" ...some log details\n" +
"Start of log entry 3\n" +
" ...some log details\n" +
" ...some log details\n" +
" ...some log details";
static final Predicate<String> isLogEntryStart = line -> line.startsWith("Start of log entry");
public static void main(String[] args) throws Exception {
try (ByteArrayInputStream gzipInputStream
= new ByteArrayInputStream(input.getBytes(StandardCharsets.UTF_8)); // mock for fileInputStream based gzipInputStream
InputStreamReader inputStreamReader = new InputStreamReader( gzipInputStream );
BufferedReader reader = new BufferedReader( inputStreamReader )) {
reader.lines()
.splitByPredicate(isLogEntryStart) // <--- What witchcraft should go here?
.map(LogEntry::new)
.forEach(out::println);
}
}
}
Constraint: I have hundreds of these large files to process, in parallel (but only a single sequential stream per file), which makes loading them them entirely into memory (e.g. by storing them as a List<String> lines
) is not feasible.
Any help appreciated!