Default compression support
@samthebest answer is correct, if you are using compression format that is by default available in Spark (Hadoop). Which are:
I have explained this topic deeper in my other answer: https://stackoverflow.com/a/45958182/1549135
Reading zip
However, if you are trying to read a zip
file you need to create a custom solution. One is mentioned in the answer I have already provided.
If you need to read multiple files from your archive, you might be interested in the answer I have provided: https://stackoverflow.com/a/45958458/1549135
Basically, all the time, using sc.binaryFiles
and later on decompressing the PortableDataStream
, like in the sample:
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}