3

I have a 1GB zip file containing about 2000 textfiles. I want to read all files and all lines as fast as possible.

    try (ZipFile zipFile = new ZipFile("file.zip")) {
        zipFile.stream().parallel().forEach(entry -> readAllLines(entry)); //reading with BufferedReader.readLine();
    }

Result: stream.parallel() is about 30-50% faster than a normal stream.

Question: could I optimize the performance even more if I'd not be reading the stream using the parallel API, but firering my own threads explicit to read from the file?

membersound
  • 81,582
  • 193
  • 585
  • 1,120
  • Hard to believe. ZIP files are inherently sequential. Threading can't help there. The biggest performance improvement you will get will be via buffering. – user207421 Apr 23 '15 at 10:26
  • 1
    Afaik, the biggest drawback is that the reference implementation uses a piece of native code that doesn’t work multi-threaded. Once the input is buffered, i.e it’s not the I/O that dominates, this implementation detail will limit the benefit of parallel execution. – Holger Apr 23 '15 at 13:49

1 Answers1

4

Maybe. Keep in mind that switching threads is somewhat expensive and parallel() of Java 8 is pretty good.

Uncompressing ZIP streams is CPU intensive, so more threads won't make things faster. If you create your own execution service where you carefully balance the number of threads with the number of cores, you might be able to find a better sweet spot than Java 8's parallel().

The other thing left is using a better buffering strategy for reading the file. But that's not easy for ZIP archives. You can try to use ZipInputStream instead of ZipFile but it's not so easy to mix the stream API with Java 8's Stream API ((de)compressing files using NIO).

Community
  • 1
  • 1
Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820