0

I am using Scala. I need to read a large gzip file and turn it into string. And I need to remove the first line. This is how I read the file:

val fis = new FileInputStream(filename)
val gz  = new GZIPInputStream(fis)

And then I tried with this Source.fromInputStream(gz).getLines.drop(1).mkString("") . But it causes out of memory error.

Therefore, I think of reading line by line and maybe put it into byte array. Then I can just convert it into a single String in the end.

But I have no idea how to do this. Any suggestion? Or any better method is also welcome.

Algorithman
  • 1,309
  • 1
  • 16
  • 39
  • Look into memory mapped IO. Also, a StringBuffer could probably help. – erip Nov 04 '17 at 22:02
  • The OOM you are getting is because the contents of the file does not fit in memory. If does not matter if you read it into an array, a list or whatever other container. You either need more memory, or to think of a way to do what you need without getting the entire content into memory. – Dima Nov 04 '17 at 22:18
  • How big is the file when it is `gunzip`ed? – dkim Nov 04 '17 at 22:52
  • @dkim around 250MB – Algorithman Nov 04 '17 at 22:56
  • It might be helpful to check the maximum JVM heap size and, if necessary, increase it. Refer to 1) [How is the default java heap size determined?](https://stackoverflow.com/a/13871564/234658) and 2) [Increase JVM heap size for Scala?](https://stackoverflow.com/q/1441373/234658). – dkim Nov 04 '17 at 23:18

1 Answers1

2

If your gzipped file is huge, you can go with BufferedReader. Here is an example. It copies all chars from gzipped file to uncompressed, but it skips the first line.

import java.util.zip.GZIPInputStream
import java.io._
import java.nio.charset.StandardCharsets

import scala.annotation.tailrec
import scala.util.Try

val bufferSize = 4096
val pathToGzFile = "/tmp/text.txt.gz"
val pathToOutputFile = "/tmp/text_without_first_line.txt"
val charset = StandardCharsets.UTF_8

val inStream = new FileInputStream(pathToGzFile)
val outStream = new FileOutputStream(pathToOutputFile)

try {
  val inGzipStream = new GZIPInputStream(inStream)
  val inReader = new InputStreamReader(inGzipStream, charset)
  val outWriter = new OutputStreamWriter(outStream, charset)
  val bufferedReader = new BufferedReader(inReader)

  val closeables =  Array[Closeable](inGzipStream, inReader, 
    outWriter, bufferedReader)
  // Read first line, so copy method will not get this - it will be skipped
  val firstLine = bufferedReader.readLine()
  println(s"First line: $firstLine")

  @tailrec
  def copy(in: Reader, out: Writer, buffer: Array[Char]): Unit = {
    // Copy while it's not end of file
    val readChars = in.read(buffer, 0, buffer.length)
    if (readChars > 0) {
      out.write(buffer, 0, readChars)
      copy(in, out, buffer)
    }
  }

  // Copy chars from bufferReader to outWriter using buffer
  copy(bufferedReader, outWriter, Array.ofDim[Char](bufferSize))

  // Close all closeabes
  closeables.foreach(c => Try(c.close()))
}
finally {
  Try(inStream.close())
  Try(outStream.close())
}
Artavazd Balayan
  • 2,353
  • 1
  • 16
  • 25