Parsing big files in Scala gives an error

Question

I'm reading the content a very big binary file using Scala Spark and writing to a table. I'm getting the below error. Is there any way to write in chunks?

StreamingQueryException: Query [id = ae9393bc-df18-4d2e-9d03-c8d4918684, runId = fb6bbd2c-b922-4660-bf44-932e46c2d] terminated with exception: Job aborted.
Caused by: Job aborted.
Caused by: Job aborted due to stage failure.
Caused by: FileReadException: Error while reading file /mnt/xxxx/testfile.zip.
Caused by: The length of dbfs:/mnt/xxx.zip is 2320548102, which exceeds the max length allowed: 2147483647.

My code is below.

var df = 
    spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "binaryFile")
    .option("cloudFiles.includeExistingFiles", "true")
    .option("recursiveFileLookup", "true")
    .option("pathGlobFilter", "*.zip")
    .schema("path string, modificationTime timestamp, length long, content binary")
    .load(mountPath("testpath"));
  var write_query = (
    df
    .select("path", "content").writeStream.format("delta")
    .option("checkpointLocation", checkpoint_path)
    .start(write_path)
  )
  write_query.awaitTermination()

Please fix your question title, this has nothing to do with the programming language, this relates to Spark. There must be some kind of configurations for this case. — AminMal, May 01 '22 at 09:42
Does this answer your question? [SQL query in Spark/scala Size exceeds Integer.MAX\_VALUE](https://stackoverflow.com/questions/42247630/sql-query-in-spark-scala-size-exceeds-integer-max-value) — Guru Stron, May 01 '22 at 12:23
Thanks! As per the above link, I tried the below and it didn't help. I tried repartition and it didn't help. Am I missing anything? Any suggestions would be helpful. sqlContext.setConf("spark.sql.shuffle.partitions", "300") sqlContext.setConf("spark.default.parallelism", "300") — testbg testbg, May 01 '22 at 18:33

score 0 · Answer 1 · answered May 02 '22 at 12:08

0

Use below method to get the Number of partitions in RDD.

rdd.getNumPartitions

There are two ways to get rid of this error.

Increase number of partitions

Try to increase number of partitions, but keep in mind keep 128MB per partition.

To increase the value of partition use below function

rdd.repartition()

Get rid of skew in your data.

Refer - https://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications/25

answered May 02 '22 at 12:08

Abhishek K

3,047
1
6
19

Thanks! I'm not using rdd. I tried using partitionBy() and didn't work. Could you pls. send a sample? Thanks! – testbg testbg May 02 '22 at 17:28

Parsing big files in Scala gives an error

1 Answers1