0

I'm supposed to read thousands of *.CSV files from S3 using Spark. These files have Content-Encoding of gzip as metadata in their properties. Normally I do:

sqlContext.read.csv("s3a://bucket/file.csv")

But that doesn't work in this case as the files are compressed. If I could change the extension of the file it would work (but I have no control over that):

sqlContext.read.csv("s3a://bucket/file.csv.gz")

I'm aware of this method to register file extensions as compressed files but adding .csv as a compressed extension is problematic for normal CSV files. Is there any way to force Spark to decompress CSV files without adding .csv as a compressed format?

Amin
  • 763
  • 7
  • 22
  • perhaps decompress as per https://stackoverflow.com/questions/17436549/uncompress-and-read-gzip-file-in-scala then `parallelize`? – joel Jul 31 '18 at 00:25
  • or to rename the files https://stackoverflow.com/questions/30614443/how-do-i-rename-a-file-in-scala – joel Jul 31 '18 at 00:31
  • @JoelBerkeley I have no control over the name of the files. Regarding decompress using Scala: I have thousands of files to read and it takes too much effort to make sure reading and decompressing those files doesn't clog the resources. – Amin Jul 31 '18 at 14:34

0 Answers0