I'm reading the content a very big binary file using Scala Spark and writing to a table. I'm getting the below error. Is there any way to write in chunks?
StreamingQueryException: Query [id = ae9393bc-df18-4d2e-9d03-c8d4918684, runId = fb6bbd2c-b922-4660-bf44-932e46c2d] terminated with exception: Job aborted.
Caused by: Job aborted.
Caused by: Job aborted due to stage failure.
Caused by: FileReadException: Error while reading file /mnt/xxxx/testfile.zip.
Caused by: The length of dbfs:/mnt/xxx.zip is 2320548102, which exceeds the max length allowed: 2147483647.
My code is below.
var df =
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.option("cloudFiles.includeExistingFiles", "true")
.option("recursiveFileLookup", "true")
.option("pathGlobFilter", "*.zip")
.schema("path string, modificationTime timestamp, length long, content binary")
.load(mountPath("testpath"));
var write_query = (
df
.select("path", "content").writeStream.format("delta")
.option("checkpointLocation", checkpoint_path)
.start(write_path)
)
write_query.awaitTermination()