PySpark S3 Gzip Files Without Extension

Asked May 27 '16 at 23:51

Active May 27 '16 at 23:51

Viewed 903 times

I'm trying to read a bunch of gzipped CSV files from S3 via PySpark. Normally textFile or spark-csv auto-decompresses gzips, but the files I'm working with don't have the .gz extension and therefore end up being read in as compressed. There are millions of files, they're owned by another team and they're updated multiple times a day.

Is there a way to forcibly tell the textFile or the spark-csv API the compression style? Or is there any other way around copying and renaming the files?

asked May 27 '16 at 23:51

econgineer

1,117
10
20

Related: https://stackoverflow.com/q/44372995/877069 – Nick Chammas Sep 20 '19 at 15:25
Related: https://stackoverflow.com/q/40285932/877069 – Nick Chammas Sep 20 '19 at 15:47

PySpark S3 Gzip Files Without Extension

0 Answers0