The high level problem I'm trying to solve is that I have a 1.5 GB SMS data dump, and I am trying to filter the file to preserve only messages to and from a single contact.
I am using lxml in Python to parse the file, but let me know if there are better options.
The structure of the XML file is like this:
SMSES (root node)
'count': 'xxxx',
(Children):
MMS
'address': 'xxxx',
'foo': 'bar',
... : ...,
(Children)
'other fields': 'that _do not_ specify address',
MMS
'address': 'xxxx',
'foo': 'bar',
... : ...,
(Children)
'other fields': 'that _do not_ specify address'
i.e., I want to traverse the children of the root node, and for every MMS where 'address' does not match a specific value, remove that MMS and all its descendents (the children tend to hold items like images, etc.).
What I've tried:
I have found question/answers like this: how to remove an element in lxml
But these threads tend to have simple examples without nested elements.
- It's not clear to me how to use
tree.xpath()
to find elements that do not match a value - It's not clear to me whether calling
remove(item)
removes the item's descendants (which I want in this case).
I've tried a very naive approach, in which I obtain an iterator, and then walk through the tree, removing elements as I go:
from lxml.etree import XMLParser, parse
p = XMLParser(huge_tree=True)
tree = parse('backup.xml', parser=p)
it = tree.iter()
item = next(it) # consume root node
for item in it:
if item.attrib['address'] != '0000':
item.getparent().remove(item)
The problem with this script is that the iterator performs DFS, and the children of MMS elements do not have the address field. So, I am looking for:
- What is the most efficient + reasonably easy way to accomplish my task?
- Otherwise, how can I force
tree.iter()
to give me a BFS iterator over only the first-degree neighbors of the root? - Does remove(item) indeed remove all descendants, or does it attach the children to the parent?
Thank you for taking the time to read. Sorry if this is a naive question -- parsing XML files isn't really my bread and butter, and the LXML documentation was difficult for me to read as a novice.
Thanks!