pyspark to read compressed files without extracting it

Asked Apr 18 '19 at 18:45

Active Apr 19 '19 at 15:42

Viewed 220 times

I see similar questions with Java/Scala, but how to import files compressed in a zip/gzip/tar format in pyspark, without the actual decompression?

I would like to hear suggestions on 1) how to get a list of files in one compressed file, 2) how to read each one into a spark dataframe using pyspark. The output I look for is a list of filename:dataframe object where the dataframe is the content of each file.

Thanks!

edited Apr 19 '19 at 15:42

asked Apr 18 '19 at 18:45

Luke

Possible duplicate of [Read whole text files from a compression in Spark](https://stackoverflow.com/questions/36604145/read-whole-text-files-from-a-compression-in-spark) – Oliver W. Apr 18 '19 at 20:21
Why do you not want to extract the archive, is there a strong reason not to do so? – Oliver W. Apr 18 '19 at 20:23
@OliverW. The compressed file could contain a large number of files that are not needed at all. Can catalyst optimize the process in this case? – Luke Apr 19 '19 at 15:36
@OliverW. I would like to see an array of filename:dataframe object as the output. I don't get it from the question you pointed. – Luke Apr 19 '19 at 15:40

pyspark to read compressed files without extracting it

0 Answers0