0

I am attempting to parse a large XML file using xml.etree.ElementTree. The file is read from Azure Blob Storage and subsequently parsed.

Given below is what I have used to read the file into my script,

blobclient = BlobClient.from_blob_url(blob_url)
data = blobclient.download_blob()

tree = ET.parse(BytesIO(data.readall()))

root = tree.getroot()

This itself takes quite a bit of time to execute. Note that the files I will be reading will be approximately 9GB in size. Is there any way to speed this up?

Next, I will be parsing the file and that code would look something like this,

some_elements = []
for some_element in root.iter('some_element'):
   result = tuple([child.text for child in some_element])
   some_elements.append(result)

I have several similar blocks of code like this to parse the other elements I am interested in (some_elements2, some_elements3 and so on). Is there some performance impact when doing this? I read somewhere that using the following is a more speedy option?

for event, elem in ET.iterparse(file, events=("start", "end")):
   # parse elements

I would like an explanation of the performance considerations here. The reason that I am asking this is that when I run my script on large files as mentioned, the execution hangs and does not continue. On smaller files, it works without any issue.

Minura Punchihewa
  • 1,498
  • 1
  • 12
  • 35
  • One of the problems with `xml.dom` and `xml.etree` is that they have to read the entire XML file into a Python data structure, and that's time- and memory-intensive. Depending on what you need, you might find `xml.sax` to be a better choice. It scans the XML and calls your callbacks to take action, but does not create the entire object in memory. If you just need a few nodes, that might be a better choice. – Tim Roberts Oct 28 '21 at 03:28
  • In your last paragraph, you say execution hangs: which traversal method are you referring to? I can't comment specifically on the technologies you ask about, but I've had general experience parsing large XML files, and the bottom line is if you don't need to hold the full DOM as an object then _don't_ -- use event-based (SAX) parsing, which processes tags without building any structure or holding on to previous elements. Even then, you may need to take extra care to avoid memory fragmentation, depending on your data. – paddy Oct 28 '21 at 03:28
  • The execution hangs when I use the first method, for some_element in root.iter('some_element'): I do actually need to parse the entire file. I need to be able to parse every element in the file, it is just that similar elements are separated out using the code I have given above. Is xml.sax still relevant here in that case? – Minura Punchihewa Oct 28 '21 at 03:39

0 Answers0