I have data stored in S3 as utf-8 encoded json files, and compressed using either snappy/lz4.
I'd like to use Spark to read/process this data, but Spark seems to require the filename suffix (.lz4
, .snappy
) to understand the compression scheme.
The issue is that I have no control over how the files are named - they will not be written with this suffix. It is also too expensive to rename all such files to include such as suffix.
Is there any way for spark to read these JSON files properly?
For parquet encoded files there is the 'parquet.compression' = 'snappy'
in Hive Metastore, which seems to solve this problem for parquet files. Is there something similar for text files?