How to load file.deflate.gz file into spark dataframe?

Question

I have source file compressed in delflate.gz format. While loading the data into Spark data frame it failed with ArrayOutofBound exception.

val cf = spark.read.option("header", "false").option("delimiter", "\u0001").option("codec", "deflate").csv("path/xxx.deflate.gz")
cf.show()

Error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 4 times, most recent failure: Lost task 0.3 in stage 15.0 (TID 871, 10.180.255.33, executor 0): java.lang.ArrayIndexOutOfBoundsException: 63

If the file is not compressed with `gzip` codec, then why the `.gz` extension?? You are just looking for trouble. — Samson Scharfrichter, Aug 17 '17 at 18:55
BTW there is a "Code Sample" format for code samples. Use it. Really. — Samson Scharfrichter, Aug 17 '17 at 18:56

score 0 · Answer 1 · answered Aug 18 '17 at 04:16

Assuming by deflate gzip file you mean a regular gzip file (since gzip is based on DEFLATE algorithm), your problem is likely in the formatting of the CSV file. You may have an inconsistent number of fields (columns) on each row and may need to change the read mode to make it permissive.

However, if you have some special snowflake Gzip file and the file extension stays that way (not recommended), you can do things the hard way by reading as binary file(s) and decompressing manually. The sc.binaryFiles function is the main one to try.

Relevant SO: Zip support in Apache Spark

How to load file.deflate.gz file into spark dataframe?

1 Answers1