I have a scenario where I have to read multiple XML files which are zipped together in PySpark.
Zip File Size: 30 GB
Size When Unzipped: 600 GB
Max size of a single file: 40 GB
Time taken to extract: 4 Hours
I am able to read the extracted XML data with the predefined schema using databricks API. But lot of time is consumed in extracting the data itself. Is there a way to read data from the zip file directly rather than extracting it?
Thanks in advance!!!!