Read compressed JSON in Spark

Asked Oct 21 '19 at 19:06

Active Oct 21 '19 at 19:13

Viewed 622 times

I have data stored in S3 as utf-8 encoded json files, and compressed using either snappy/lz4. I'd like to use Spark to read/process this data, but Spark seems to require the filename suffix (.lz4, .snappy) to understand the compression scheme.

The issue is that I have no control over how the files are named - they will not be written with this suffix. It is also too expensive to rename all such files to include such as suffix.

Is there any way for spark to read these JSON files properly? For parquet encoded files there is the 'parquet.compression' = 'snappy' in Hive Metastore, which seems to solve this problem for parquet files. Is there something similar for text files?

edited Oct 21 '19 at 19:13

asked Oct 21 '19 at 19:06

user12121909

https://stackoverflow.com/questions/45082832/how-to-read-partitioned-parquet-files-from-s3-using-pyarrow-in-python – vaquar khan Oct 21 '19 at 20:30
Parquet files are solvable, but I am referring specifically to the json files in text format – user12121909 Oct 21 '19 at 20:33
Unfortunately, you can only set the compression codec on DataFrame writers. – Hristo Iliev Oct 22 '19 at 09:42

Read compressed JSON in Spark

0 Answers0