0

I have a file AA.zip which again contains multiple files for ex aa.tar.gz, bb.tar.gz , etc

I need to read this files in spark scala , how can i achieve that??

the only problem here is to extract the contents of zip file.

  • Possible duplicate of [Read whole text files from a compression in Spark](https://stackoverflow.com/questions/36604145/read-whole-text-files-from-a-compression-in-spark) – 10465355 Nov 20 '18 at 11:35
  • no, that question is about a directory containing compressed files but here i have a file which is in zip format and again contains files with .tar.gz format. – sheetal kaur Nov 20 '18 at 11:37

1 Answers1

0

so ZIPs on HDFS are going to be a bit tricky because they don't split well so you'll have to process 1 or more zip file per executor. This is also one of the few cases were you probably have to fall back to SparkContext because for some reason binary file support in Spark is not that good.

https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext

there's a readBinaryFiles there which gives you access to the zip binary data which you can then utilize with the usual ZIP-handling from java or scala.

Dominic Egger
  • 1,016
  • 5
  • 7