1

I have the following xml output:

<?xml version='1.0' encoding='ISO-8859-1'?>
<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<images>
  <image file='VideoExtract/testset/10224.jpg'>
    <box top='436' left='266' width='106' height='61'>
      <label>1</label>
    </box>
  </image>
  <image file='VideoExtract/testset/1044.jpg'>
    <box top='507' left='330' width='52' height='27'>
      <label>2</label>
    </box>
  </image>
  <image file='VideoExtract/testset/10675.jpg'>
  </image>
</images>
</dataset>

From this, I want to delete all the nodes that doesn't have any child nodes. For example, the third image node within images does not have child node. How can I delete this child node. The desired output would be

<?xml version='1.0' encoding='ISO-8859-1'?>
<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<images>
  <image file='VideoExtract/testset/10224.jpg'>
    <box top='436' left='266' width='106' height='61'>
      <label>1</label>
    </box>
  </image>
  <image file='VideoExtract/testset/1044.jpg'>
    <box top='507' left='330' width='52' height='27'>
      <label>2</label>
    </box>
  </image>
</images>
</dataset>

I have tried the following, but it doesn't help.

from lxml import etree as ET
root = ET.parse('testxml.xml')
for child in root.iterfind('targetElement'):
    if(len(child.attrib) < 1 and len(child) < 1):
        child.getparent().remove(child)
Apricot
  • 2,925
  • 5
  • 42
  • 88

2 Answers2

2

Since you use the lxml module, consider XSLT, the special-purpose language designed to transform XML files. With this approach, no for loops or if logic is required.

In fact, your XML looks to be using XSLT per the processing instruction so you might be able to include below script in that stylesheet. Following script runs the Identity Transform and an empty template on any <image> tags with zero count of children. Empty templates remove such nodes.

XSLT (save as .xsl file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:strip-space elements="*"/>
  <xsl:output indent="yes"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="image[count(*)=0]"/>

</xsl:stylesheet>

Python

import lxml.etree as et

doc = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')

transform = et.XSLT(xsl)    
result = transform(doc)

# OUTPUT TO SCREEN
print(result)

# OUTPUT TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)

Output

<?xml version="1.0"?>
<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?><dataset>
  <images>
    <image file="VideoExtract/testset/10224.jpg">
      <box top="436" left="266" width="106" height="61">
        <label>1</label>
      </box>
    </image>
    <image file="VideoExtract/testset/1044.jpg">
      <box top="507" left="330" width="52" height="27">
        <label>2</label>
      </box>
    </image>
  </images>
</dataset>
Parfait
  • 104,375
  • 17
  • 94
  • 125
1

This code might do exactly what you have asked for in your question. I doubt it's exactly what you want.

>>> from lxml import etree
>>> tree = etree.parse('testxml.xml')
>>> for el in tree.iter():
...     el.tag, len(list(el.iterchildren()))
...     if not len(list(el.iterchildren())):
...         parent = el.getparent()
...         if parent is not None:
...             parent.remove(el)
...             
('dataset', 1)
('images', 3)
('image', 1)
('box', 1)
('label', 0)
('image', 1)
('box', 1)
('label', 0)
('image', 0)
>>> tree.write('temp.xml', pretty_print=True)

Here's the resulting xml file.

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<images>
  <image file="VideoExtract/testset/10224.jpg">
    <box top="436" left="266" width="106" height="61">
      </box>
  </image>
  <image file="VideoExtract/testset/1044.jpg">
    <box top="507" left="330" width="52" height="27">
      </box>
  </image>
  </images>
</dataset>

I notice that the label nodes contain no nodes (although they contain text!); therefore, they are missing from the output. Is this what you really want?

In contrast, this version of the code preserves the label elements.

>>> tree = etree.parse('testxml.xml')
>>> for el in tree.iter():
...     if len(list(el.iterchildren())) or ''.join([_.strip() for _ in el.itertext()]):
...         pass
...     else:
...         parent = el.getparent()
...         if parent is not None:
...             parent.remove(el)

Here's the resulting file in this case.

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<images>
  <image file="VideoExtract/testset/10224.jpg">
    <box top="436" left="266" width="106" height="61">
      <label>1</label>
    </box>
  </image>
  <image file="VideoExtract/testset/1044.jpg">
    <box top="507" left="330" width="52" height="27">
      <label>2</label>
    </box>
  </image>
  </images>
</dataset>
Bill Bell
  • 21,021
  • 5
  • 43
  • 58