Read data from XML files which are zipped in PySpark

Asked Feb 15 '19 at 14:15

Active Feb 15 '19 at 14:23

Viewed 753 times

I have a scenario where I have to read multiple XML files which are zipped together in PySpark.

Zip File Size: 30 GB

Size When Unzipped: 600 GB

Max size of a single file: 40 GB

Time taken to extract: 4 Hours

I am able to read the extracted XML data with the predefined schema using databricks API. But lot of time is consumed in extracting the data itself. Is there a way to read data from the zip file directly rather than extracting it?

Thanks in advance!!!!

edited Feb 15 '19 at 14:23

asked Feb 15 '19 at 14:15

Divya Teja

2

@ForceBru: I am more concerned about reading in spark rather than using native Python. As far as I understand, both the questions are different. – Divya Teja Feb 15 '19 at 14:21
Is it 600 GB in a single file? – 10465355 Feb 15 '19 at 14:37
No Multiple files – Divya Teja Feb 15 '19 at 14:38
See [Read whole text files from a compression in Spark](https://stackoverflow.com/q/36604145/10465355). – 10465355 Feb 15 '19 at 14:49
How are they zipped? Gzip? – Blokje5 Feb 15 '19 at 14:51
Windows zipped file *.zip – Divya Teja Feb 18 '19 at 05:41

Read data from XML files which are zipped in PySpark

0 Answers0