I have some very big XML files (around ~100-150 MB each).
One element in my XML is M
(for member), which is a child of HH
(household) -
i.e. - each household contains one or more members.
What I need to do is to take all the members that satisfies some conditions (the conditions can change, and can be both on the household and on the members - e.g. - just members from households with high income (constraint on the household), who's age is between 18-49 (constraint on the member)) - and to further process them in a rather complicated function.
this is what I'm doing:
import lxml.etree as ET
all_members=[]
tree=ET.parse(whole_path)
root=tree.getroot()
HH_str='//H' #get all the households
HH=tree.xpath(HH_str)
for H in HH:
'''check if the hh satisfies the condition'''
if(is_valid_hh(H)):
M_str='.//M'
M=H.xpath(M_str)
for m in M:
if(is_valid_member(m)):
all_members.append(m)
for member in all_members:
'''do something complicated'''
the problem with this is that it takes all my memory (and I have 32 GB)! how can I iterate over xml elements more efficiently?
any help will be appreciated...