2

I have a scenario where I have to read multiple XML files which are zipped together in PySpark.

Zip File Size: 30 GB

Size When Unzipped: 600 GB

Max size of a single file: 40 GB

Time taken to extract: 4 Hours

I am able to read the extracted XML data with the predefined schema using databricks API. But lot of time is consumed in extracting the data itself. Is there a way to read data from the zip file directly rather than extracting it?

Thanks in advance!!!!

Divya Teja
  • 23
  • 5

0 Answers0