I have XML files with sizes of hundreds of megabytes to tens of gigabytes and use Python's
cElementTree
to process them. Due to limited memory and low speed, I don't want to load all contents into memory using et.parse
then find
or findall
method to find whether the tag exists (I didn't try this way, actually). Now I simply use et.iterparse
to iterate through all tags to achieve this aim. In the case that the tag locates close to the end of the file, this can be very slow as well. I wonder whether there exists a better way to achieve this and get the location of the tag? If I know the top level (e.g., index) the tag locates, at which the size is much smaller than other parts of the file, is it possible to iterate through the top level tag and then target that part to parse? I searched online, but surprisingly no related questions are posted. Do I miss anything? Thanks in advance.
Asked
Active
Viewed 49 times
-1

Elkan
- 546
- 8
- 23
-
Does this answer your question? [Using Python Iterparse For Large XML Files](https://stackoverflow.com/questions/7171140/using-python-iterparse-for-large-xml-files) – stovfl Mar 07 '20 at 09:15
-
@stovfl No! I said I have already used `iterparse` to go through all tags to avoid memory problem, the same as what have been posted there. Please read my question carefully. – Elkan Mar 07 '20 at 09:18
-
***" to avoid memory problem"***: The accepted answer, show how to avoid ***memory problem***. – stovfl Mar 07 '20 at 09:23
-
@stovfl This problem has already been solved by using `clear` method. I mean it's still very slow to iterate through all tags using `iterparse` to find out whether the tag exists. My question is can I achieve this in a much faster way? This is what I asked for, not memory problem. – Elkan Mar 07 '20 at 09:26
1 Answers
0
I solved this by reading the file block by block instead of parsing the file using cElementTree
. My tags are close to the end of the file, so according to this answer, I read a block of contexts with specified size block_size
at a time from the end of the file by using file.seek
and file.read
methods, and line = f.read(block_size)
, and then simply using "<my_tag " in line
(or more specific tag name to avoid ambiguity) to check whether the tag exists. This is much faster then using iterparse
to go through all tags.

Elkan
- 546
- 8
- 23