Java 8's Files.lines(): Performance concern for very long line

Question

Java 8's stream API has been convenient and gained popularity. For file I/O, I found that two API's are provided to generate stream output: Files.lines(path), and bufferedReader.lines();

I did not find a stream API which provide Stream of fixed-sized buffers for reading files, though.

My concern is: in case of files with very long line, e.g. a 4GB file with only a single line, aren't these line-based API very inefficient?

The line-based reader will need at least 4GB memory to keep that line. Compared to a fix-sized buffer reader (fileInputStream.read(byte[] b, int off, int len)), which takes at most the buffer size of memory.

If the above concern is true, are there any Stream API for file i/o API which are more efficient?

`Files.lines(path)` and `bufferedReader.lines()` are meant to read characters/strings whereas `InputStream::read` methods are used to read bytes. I don't know where your problem is. — Flown, Oct 11 '17 at 05:33
If input is line-based, and a Stream chain can process each line individually, how would that same data be processable in fixed-size blocks? — Andreas, Oct 11 '17 at 05:43

score 5 · Answer 1 · answered Oct 11 '17 at 06:20

5

If you have a 4GB text file with a single line, and you're processing it "line by line", then you've made a serious error in your programming by not understanding the data you're working with.

They're convenience methods for when you need to do simple work with data like CSV or other such format, and the line sizes are manageable.

A real life example of a 4GB text file with a single line would be an XML file without line breaks. You would use a streaming XML parser to read that, not roll your own solution that reads line by line.

answered Oct 11 '17 at 06:20

Kayaman

72,141
5
83
121

I know the line-based API isn't right for such scenario. That's why I asked for a better API in the first place. Can you give an example of such "Streaming XML parser" with maven dependency? – modeller Oct 11 '17 at 17:25
Yeah, StAX. You can look up SAX and DOM too, and compare the differences. – Kayaman Oct 11 '17 at 17:29

score 2 · Accepted Answer · answered Oct 18 '17 at 15:13

It depends on how you want to process the data, which method of delivery is appropriate. So if your processing requires processing the data line by line, there is no way around doing it that way.

If you really want fixed size chunks of character data, you can using the following method(s):

public static Stream<String> chunks(Path path, int chunkSize) throws IOException {
    return chunks(path, chunkSize, StandardCharsets.UTF_8);
}
public static Stream<String> chunks(Path path, int chunkSize, Charset cs)
throws IOException {
    Objects.requireNonNull(path);
    Objects.requireNonNull(cs);
    if(chunkSize<=0) throw new IllegalArgumentException();

    CharBuffer cb = CharBuffer.allocate(chunkSize);
    BufferedReader r = Files.newBufferedReader(path, cs);
    return StreamSupport.stream(
        new Spliterators.AbstractSpliterator<String>(
            Files.size(path)/chunkSize, Spliterator.ORDERED|Spliterator.NONNULL) {
            @Override public boolean tryAdvance(Consumer<? super String> action) {
                try { do {} while(cb.hasRemaining() && r.read(cb)>0); }
                catch (IOException ex) { throw new UncheckedIOException(ex); }
                if(cb.position()==0) return false;
                action.accept(cb.flip().toString());
                return true;
            }
    }, false).onClose(() -> {
        try { r.close(); } catch(IOException ex) { throw new UncheckedIOException(ex); }
    });
}

but I wouldn’t be surprised if your next question is “how can I merge adjacent stream elements”, as these fixed sized chunks are rarely the natural data unit to your actual task.

More than often, the subsequent step is to perform pattern matching within the contents and in this case, it’s better to use Scanner in the first place, which is capable of performing pattern matching while streaming the data, which can be done efficiently as the regex engine tells whether buffering more data could change the outcome of a match operation (see hitEnd() and requireEnd()). Unfortunately, generating a stream of matches from a Scanner has only been added in Java 9, but see this answer for a back-port of that feature to Java 8.

Thanks. I was looking for a starter guide of writing my own stream-generating API. I think this answer is a right starting point. I haven't got to the stage of finding adjacent element yet. But good to have a goad sign where to look for when it comes into picture. — modeller, Oct 18 '17 at 20:37

Java 8's Files.lines(): Performance concern for very long line

2 Answers2