17

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?

Details : File is csv with tab delimited.

Brad Hein
  • 10,997
  • 12
  • 51
  • 74
prady
  • 563
  • 4
  • 9
  • 24
  • Possible dupe of many in SO. Some are: [this](https://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark) and [this](https://stackoverflow.com/questions/32080475/how-to-read-a-zip-containing-multiple-files-in-apache-spark) – sujit Mar 26 '18 at 12:45
  • 2
    `spark.read.csv` works with gzip files – philantrovert Mar 26 '18 at 12:54

1 Answers1

20

Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):

val df = spark.read.option("sep", "\t").csv("file.csv.gz")

PySpark:

df = spark.read.csv("file.csv.gz", sep='\t')

The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

Shaido
  • 27,497
  • 23
  • 70
  • 73
  • Thanks, I did read the file directly using read csv option. I could observe the slowness. Is it best practice to read the whole file using single core ? – prady Mar 27 '18 at 05:04
  • @prady Due to the file being a `gzip` it must be read using a single core. A work-around would be to first unzip the file and the use Spark to read the data. Or you could change the compression type, refer to this question: https://stackoverflow.com/questions/14820450/best-splittable-compression-for-hadoop-input-bz2 – Shaido Mar 27 '18 at 05:12
  • Thanks for the reference – prady Mar 27 '18 at 05:43
  • can someone tell me how to read a csv.bz2 in to a dataframe? – Sithija Piyuman Thewa Hettige Mar 11 '21 at 07:41
  • @SithijaPiyumanThewaHettige: The same method as in this answer should apply, i.e.: `spark.read.textFile("file.csv.bz2")` (you could try `spark.read.textFile` as well). – Shaido Mar 11 '21 at 08:30