1

I have some very big XML files (around ~100-150 MB each).

One element in my XML is M (for member), which is a child of HH (household) -

i.e. - each household contains one or more members.

What I need to do is to take all the members that satisfies some conditions (the conditions can change, and can be both on the household and on the members - e.g. - just members from households with high income (constraint on the household), who's age is between 18-49 (constraint on the member)) - and to further process them in a rather complicated function.

this is what I'm doing:

import lxml.etree as ET
all_members=[]
tree=ET.parse(whole_path)
root=tree.getroot()
HH_str='//H' #get all the households
HH=tree.xpath(HH_str)
for H in HH:
'''check if the hh satisfies the condition'''
    if(is_valid_hh(H)):
        M_str='.//M'
        M=H.xpath(M_str)
        for m in M:
            if(is_valid_member(m)):
                all_members.append(m)

for member in all_members:
'''do something complicated'''

the problem with this is that it takes all my memory (and I have 32 GB)! how can I iterate over xml elements more efficiently?

any help will be appreciated...

Binyamin Even
  • 3,318
  • 1
  • 18
  • 45
  • Possible duplicate of [using lxml and iterparse() to parse a big (+- 1Gb) XML file](https://stackoverflow.com/questions/9856163/using-lxml-and-iterparse-to-parse-a-big-1gb-xml-file) – Tai Dec 24 '17 at 18:25
  • @Tai - I tried to use `iterparse()` and couldn't figure out how. can you help me with that? – Binyamin Even Dec 24 '17 at 18:30
  • Can I have a sample of your data? – Tai Dec 24 '17 at 18:36
  • unfortunately no, it's confidential. but I wrote the question in a rather general form... – Binyamin Even Dec 24 '17 at 18:42
  • 1
    @BinyaminEven anonymize a chunk of your data, or make up some similar piece of data that has the same structure with your own data and share here. That way folks can solve your problem in a heartbeat. – FatihAkici Dec 24 '17 at 18:49
  • No problem. Study the post and the official document. I think you can figure it out after some experiments. – Tai Dec 24 '17 at 18:51

1 Answers1

1

etree is going to consume a lot of memory (yes, even with iterparse()), and sax is really clunky. However, pulldom to the rescue!

from xml.dom import pulldom
doc = pulldom.parse('large.xml')
for event, node in doc:
    if event == pulldom.START_ELEMENT and node.tagName == 'special': 
        # Node is 'empty' here       
        doc.expandNode(node)
        # Now we got it all
        if is_valid_hh(node):
            ...do things...

It's one of those libraries no one who did not have to use it seems to know about. Docs at e.g. https://docs.python.org/3.7/library/xml.dom.pulldom.html

Fredrik Håård
  • 2,856
  • 1
  • 24
  • 32
  • `iterparse()` should be fine with the OP's case. OP's files are not too big? But good to know another library. – Tai Dec 25 '17 at 12:01
  • Quick testing (on Windows) shows even a 300MB file can take over 2GB to parse with iterparse - maybe depending on contents? Either way, if running on anything but a dev station, that is a _lot_ of unexpected memory usage. – Fredrik Håård Jan 03 '18 at 21:53
  • Not sure though. I previously parse a 1 GB file on my 3GB memory laptop and it was fine. Did you clean the root whenever you don't need it? – Tai Jan 03 '18 at 21:56