Performance issues while reading zip or 7z files in java/scala

Question

I am reading 7z and zip files in Scala. The way I am doing it is by reading bytes in the file as follows

val zipInputStream = new ZipInputStream(new FileInputStream(file));
var arrayBufferValues = ArrayBuffer[String]();
val buffer = new Array[Byte](1024);
var readData:Int = 0;
while({entry = zipInputStream.getNextEntry; entry != null}) {
       while({readData = archiveFile.read(buffer); readData != -1}) {
             content7zStream.write(buffer, 0, readData);
             //println(contentBytes.toString());
             arrayBufferValues += content7zStream.toString("UTF-8");
             println(arrayBufferValues.mkString)
       }
       println("Done with processing file ====>>>>> " + Paths.get(file).getFileName + " ---- " + entry.getName);
       parseFilesMap.put(Paths.get(file).getFileName + "^" + entry.getName, arrayBufferValues)
       arrayBufferValues.clear();
       content7zStream.close(); 
}

However, I am seeing a lot of performance issues when there are multiple csv files (say about 20 MB) inside the 7z file.

It takes hours to process and the process still doesn't seem to complete. Sometimes I receive OutOfMemory exception.

Is there a better way to do it or am I missing something here?

Thanks!

score 0 · Answer 1 · answered Apr 02 '18 at 12:04

Here are some of observations, hope it helps:

  // definitely a performance killer
  // try loging something shorter or comment it ouf if not really needed
  println(arrayBufferValues.mkString)

20 MB of zipped file can be relatively a lot of data that you then put into memory i.e into arrayBufferValues

I just created smallish example (don't do this in any sort of production code):

  var arrayBufferValues = ArrayBuffer[String]()

  val start = System.currentTimeMillis()

  while (true) {
    try {
      arrayBufferValues += Random.nextString(1024)
    }
    catch {
      case e: OutOfMemoryError ⇒
        println(s"${System.currentTimeMillis() - start}ms")
        System.exit(0)
    }
  }

With this approach on my local and specific settings it takes me to cause OutOfMemory in 160 seconds. My assumption is that your process picks up some very large files, so you might give it additional memory which would enable it to finish up the processing.

I played around a bit with the provided example.

Then I tweaked the runtime by using following answer: https://stackoverflow.com/a/2294280/7413631

Here are a few test results (on my local machine):

-Xmx200m  => 11090ms
-Xmx300m  => 15295ms
-Xmx1024m => 54221ms
....

Basically it's logical the more memory you give to the process the more time it takes to run out of it. Which kind of sounds like your symptoms.

My advice would be to give more memory to your process if you want to keep processing the way you wrote it now.

And don't println as much and mkString where not needed, it kills your performance.

Performance issues while reading zip or 7z files in java/scala

1 Answers1