I'm supposed to read thousands of *.CSV files from S3 using Spark. These files have Content-Encoding
of gzip
as metadata in their properties. Normally I do:
sqlContext.read.csv("s3a://bucket/file.csv")
But that doesn't work in this case as the files are compressed. If I could change the extension of the file it would work (but I have no control over that):
sqlContext.read.csv("s3a://bucket/file.csv.gz")
I'm aware of this method to register file extensions as compressed files but adding .csv as a compressed extension is problematic for normal CSV files. Is there any way to force Spark to decompress CSV files without adding .csv as a compressed format?