0

I have source file compressed in delflate.gz format. While loading the data into Spark data frame it failed with ArrayOutofBound exception.

val cf = spark.read.option("header", "false").option("delimiter", "\u0001").option("codec", "deflate").csv("path/xxx.deflate.gz")
cf.show()

Error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 4 times, most recent failure: Lost task 0.3 in stage 15.0 (TID 871, 10.180.255.33, executor 0): java.lang.ArrayIndexOutOfBoundsException: 63

Community
  • 1
  • 1
anand
  • 316
  • 1
  • 12

1 Answers1

0

Assuming by deflate gzip file you mean a regular gzip file (since gzip is based on DEFLATE algorithm), your problem is likely in the formatting of the CSV file. You may have an inconsistent number of fields (columns) on each row and may need to change the read mode to make it permissive.

However, if you have some special snowflake Gzip file and the file extension stays that way (not recommended), you can do things the hard way by reading as binary file(s) and decompressing manually. The sc.binaryFiles function is the main one to try.

Relevant SO: Zip support in Apache Spark

Garren S
  • 5,552
  • 3
  • 30
  • 45