fast checking the existence of tag in large XML using python cElementTree

Question

I have XML files with sizes of hundreds of megabytes to tens of gigabytes and use Python's cElementTree to process them. Due to limited memory and low speed, I don't want to load all contents into memory using et.parse then find or findall method to find whether the tag exists (I didn't try this way, actually). Now I simply use et.iterparse to iterate through all tags to achieve this aim. In the case that the tag locates close to the end of the file, this can be very slow as well. I wonder whether there exists a better way to achieve this and get the location of the tag? If I know the top level (e.g., index) the tag locates, at which the size is much smaller than other parts of the file, is it possible to iterate through the top level tag and then target that part to parse? I searched online, but surprisingly no related questions are posted. Do I miss anything? Thanks in advance.

Does this answer your question? [Using Python Iterparse For Large XML Files](https://stackoverflow.com/questions/7171140/using-python-iterparse-for-large-xml-files) — stovfl, Mar 07 '20 at 09:15
@stovfl No! I said I have already used `iterparse` to go through all tags to avoid memory problem, the same as what have been posted there. Please read my question carefully. — Elkan, Mar 07 '20 at 09:18
***" to avoid memory problem"***: The accepted answer, show how to avoid ***memory problem***. — stovfl, Mar 07 '20 at 09:23
@stovfl This problem has already been solved by using `clear` method. I mean it's still very slow to iterate through all tags using `iterparse` to find out whether the tag exists. My question is can I achieve this in a much faster way? This is what I asked for, not memory problem. — Elkan, Mar 07 '20 at 09:26

score 0 · Accepted Answer · answered Mar 07 '20 at 09:37

I solved this by reading the file block by block instead of parsing the file using cElementTree. My tags are close to the end of the file, so according to this answer, I read a block of contexts with specified size block_size at a time from the end of the file by using file.seek and file.read methods, and line = f.read(block_size), and then simply using "<my_tag " in line (or more specific tag name to avoid ambiguity) to check whether the tag exists. This is much faster then using iterparse to go through all tags.

fast checking the existence of tag in large XML using python cElementTree

1 Answers1