How to read ".gz" compressed file using spark DF or DS?

Question

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?

Details : File is csv with tab delimited.

Possible dupe of many in SO. Some are: [this](https://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark) and [this](https://stackoverflow.com/questions/32080475/how-to-read-a-zip-containing-multiple-files-in-apache-spark) — sujit, Mar 26 '18 at 12:45

Shaido · Accepted Answer · 2019-04-24T09:52:28.303

20

Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):

val df = spark.read.option("sep", "\t").csv("file.csv.gz")

PySpark:

df = spark.read.csv("file.csv.gz", sep='\t')

The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

edited Apr 24 '19 at 09:52

answered Mar 27 '18 at 01:17

Shaido

27,497
23
70
73

Thanks, I did read the file directly using read csv option. I could observe the slowness. Is it best practice to read the whole file using single core ? – prady Mar 27 '18 at 05:04
@prady Due to the file being a `gzip` it must be read using a single core. A work-around would be to first unzip the file and the use Spark to read the data. Or you could change the compression type, refer to this question: https://stackoverflow.com/questions/14820450/best-splittable-compression-for-hadoop-input-bz2 – Shaido Mar 27 '18 at 05:12
Thanks for the reference – prady Mar 27 '18 at 05:43
can someone tell me how to read a csv.bz2 in to a dataframe? – Sithija Piyuman Thewa Hettige Mar 11 '21 at 07:41
@SithijaPiyumanThewaHettige: The same method as in this answer should apply, i.e.: `spark.read.textFile("file.csv.bz2")` (you could try `spark.read.textFile` as well). – Shaido Mar 11 '21 at 08:30

How to read ".gz" compressed file using spark DF or DS?

1 Answers1

Linked