0

I'm reading the content a very big binary file using Scala Spark and writing to a table. I'm getting the below error. Is there any way to write in chunks?

StreamingQueryException: Query [id = ae9393bc-df18-4d2e-9d03-c8d4918684, runId = fb6bbd2c-b922-4660-bf44-932e46c2d] terminated with exception: Job aborted.
Caused by: Job aborted.
Caused by: Job aborted due to stage failure.
Caused by: FileReadException: Error while reading file /mnt/xxxx/testfile.zip.
Caused by: The length of dbfs:/mnt/xxx.zip is 2320548102, which exceeds the max length allowed: 2147483647.

My code is below.

var df = 
    spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "binaryFile")
    .option("cloudFiles.includeExistingFiles", "true")
    .option("recursiveFileLookup", "true")
    .option("pathGlobFilter", "*.zip")
    .schema("path string, modificationTime timestamp, length long, content binary")
    .load(mountPath("testpath"));
  var write_query = (
    df
    .select("path", "content").writeStream.format("delta")
    .option("checkpointLocation", checkpoint_path)
    .start(write_path)
  )
  write_query.awaitTermination() 
halfer
  • 19,824
  • 17
  • 99
  • 186
testbg testbg
  • 193
  • 2
  • 11
  • Please fix your question title, this has nothing to do with the programming language, this relates to Spark. There must be some kind of configurations for this case. – AminMal May 01 '22 at 09:42
  • Does this answer your question? [SQL query in Spark/scala Size exceeds Integer.MAX\_VALUE](https://stackoverflow.com/questions/42247630/sql-query-in-spark-scala-size-exceeds-integer-max-value) – Guru Stron May 01 '22 at 12:23
  • Thanks! As per the above link, I tried the below and it didn't help. I tried repartition and it didn't help. Am I missing anything? Any suggestions would be helpful. sqlContext.setConf("spark.sql.shuffle.partitions", "300") sqlContext.setConf("spark.default.parallelism", "300") – testbg testbg May 01 '22 at 18:33

1 Answers1

0

Use below method to get the Number of partitions in RDD.

rdd.getNumPartitions

There are two ways to get rid of this error.

  1. Increase number of partitions

Try to increase number of partitions, but keep in mind keep 128MB per partition.

To increase the value of partition use below function

rdd.repartition()
  1. Get rid of skew in your data.

Refer - https://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications/25

Abhishek K
  • 3,047
  • 1
  • 6
  • 19