1

The high level problem I'm trying to solve is that I have a 1.5 GB SMS data dump, and I am trying to filter the file to preserve only messages to and from a single contact.

I am using lxml in Python to parse the file, but let me know if there are better options.

The structure of the XML file is like this:

SMSES (root node)
  'count': 'xxxx',
  (Children):
      MMS
          'address': 'xxxx',
          'foo':     'bar',
           ... : ...,
           (Children)
               'other fields': 'that _do not_ specify address',
      MMS
          'address': 'xxxx',
          'foo':     'bar',
           ... : ...,
           (Children)
               'other fields': 'that _do not_ specify address'

i.e., I want to traverse the children of the root node, and for every MMS where 'address' does not match a specific value, remove that MMS and all its descendents (the children tend to hold items like images, etc.).

What I've tried:

I have found question/answers like this: how to remove an element in lxml

But these threads tend to have simple examples without nested elements.

  • It's not clear to me how to use tree.xpath() to find elements that do not match a value
  • It's not clear to me whether calling remove(item) removes the item's descendants (which I want in this case).

I've tried a very naive approach, in which I obtain an iterator, and then walk through the tree, removing elements as I go:

from lxml.etree import XMLParser, parse
p = XMLParser(huge_tree=True)
tree = parse('backup.xml', parser=p)

it = tree.iter()
item = next(it) # consume root node

for item in it:
    if item.attrib['address'] != '0000':
        item.getparent().remove(item)

The problem with this script is that the iterator performs DFS, and the children of MMS elements do not have the address field. So, I am looking for:

  • What is the most efficient + reasonably easy way to accomplish my task?
  • Otherwise, how can I force tree.iter() to give me a BFS iterator over only the first-degree neighbors of the root?
  • Does remove(item) indeed remove all descendants, or does it attach the children to the parent?

Thank you for taking the time to read. Sorry if this is a naive question -- parsing XML files isn't really my bread and butter, and the LXML documentation was difficult for me to read as a novice.

Thanks!

Addison
  • 131
  • 1
  • 9
  • It would be easier to help if you asked one distinct question at a time. Seeing a trimmed-down sample of actual XML instead of just "the structure" would also help us understand. Apparently, your "naive approach" does not cause severe memory problems even though the file is large. Is that true? – mzjn Oct 28 '19 at 07:50
  • Can you please add a valid XML snippet to the question? – balderman Oct 28 '19 at 08:34

1 Answers1

0

There's a new release of Saxon/C out last week with a Python language binding, incorporating XSLT 3.0 streaming capability: it's very new software but you could give it a try (with a Saxon-EE evaluation license available from saxonica.com). The stylesheet is very simple:

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="3.0">

<xsl:mode streamable="yes"/>

<xsl:template match="/">
<SMSES>
   <xsl:copy-of select="SMS[@address='specific value']"/>
</SMSES>
</xsl:template>

</xsl:transform>

Unfortunately you've abstracted your XML so I can't tell whether "address" is actually an element or an attribute, and it makes a considerable difference when streaming. I've assumed here that it's an attribute, but if you provide a real XML sample then I can help you produce some real working XSLT code.

You could equally well run this directly from the command line using the established Saxon/Java product if there's no real constraint that it has to be run from Python. But either way, streaming requires the enterprise edition of Saxon.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164