0

I see similar questions with Java/Scala, but how to import files compressed in a zip/gzip/tar format in pyspark, without the actual decompression?

I would like to hear suggestions on 1) how to get a list of files in one compressed file, 2) how to read each one into a spark dataframe using pyspark. The output I look for is a list of filename:dataframe object where the dataframe is the content of each file.

Thanks!

Luke
  • 720
  • 1
  • 9
  • 22
  • Possible duplicate of [Read whole text files from a compression in Spark](https://stackoverflow.com/questions/36604145/read-whole-text-files-from-a-compression-in-spark) – Oliver W. Apr 18 '19 at 20:21
  • Why do you not want to extract the archive, is there a strong reason not to do so? – Oliver W. Apr 18 '19 at 20:23
  • @OliverW. The compressed file could contain a large number of files that are not needed at all. Can catalyst optimize the process in this case? – Luke Apr 19 '19 at 15:36
  • @OliverW. I would like to see an array of filename:dataframe object as the output. I don't get it from the question you pointed. – Luke Apr 19 '19 at 15:40

0 Answers0