I am attempting to parse a large XML file using xml.etree.ElementTree. The file is read from Azure Blob Storage and subsequently parsed.
Given below is what I have used to read the file into my script,
blobclient = BlobClient.from_blob_url(blob_url)
data = blobclient.download_blob()
tree = ET.parse(BytesIO(data.readall()))
root = tree.getroot()
This itself takes quite a bit of time to execute. Note that the files I will be reading will be approximately 9GB in size. Is there any way to speed this up?
Next, I will be parsing the file and that code would look something like this,
some_elements = []
for some_element in root.iter('some_element'):
result = tuple([child.text for child in some_element])
some_elements.append(result)
I have several similar blocks of code like this to parse the other elements I am interested in (some_elements2, some_elements3 and so on). Is there some performance impact when doing this? I read somewhere that using the following is a more speedy option?
for event, elem in ET.iterparse(file, events=("start", "end")):
# parse elements
I would like an explanation of the performance considerations here. The reason that I am asking this is that when I run my script on large files as mentioned, the execution hangs and does not continue. On smaller files, it works without any issue.