I have a file AA.zip which again contains multiple files for ex aa.tar.gz, bb.tar.gz , etc
I need to read this files in spark scala , how can i achieve that??
the only problem here is to extract the contents of zip file.
I have a file AA.zip which again contains multiple files for ex aa.tar.gz, bb.tar.gz , etc
I need to read this files in spark scala , how can i achieve that??
the only problem here is to extract the contents of zip file.
so ZIPs on HDFS are going to be a bit tricky because they don't split well so you'll have to process 1 or more zip file per executor. This is also one of the few cases were you probably have to fall back to SparkContext
because for some reason binary file support in Spark is not that good.
https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext
there's a readBinaryFiles
there which gives you access to the zip binary data which you can then utilize with the usual ZIP-handling from java or scala.