How to find and remove elements in XML file (with name_spaces) by condition with Python

Question

I have an XML file that I want to remove elements from based on conditions. However, the XML file has namespaces which for some unclear reason do not allow me to perform the procedures described: 1, 2, 3, 4 and 5.

My XML looks like this:

    <?xml version='1.0' encoding='UTF-8'?>
        <PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
            <Page imageFilename="1.png">
                <TextRegion custom="a">
                    <TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
                        <TextEquiv>
                            <Unicode> abc </Unicode>
                        </TextEquiv>
                    </TextLine>
                    <TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">
                        <TextEquiv>
                            <Unicode />
                        </TextEquiv>
                </TextRegion>
            </Page>
        </PcGts>

My goal is to clear all TextLine's nodes where there is no text in the "Unicode" tag. So the output will be:

    <?xml version='1.0' encoding='UTF-8'?>
        <PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
            <Page imageFilename="1.png">
                <TextRegion custom="a">
                    <TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
                        <TextEquiv>
                            <Unicode> abc </Unicode>
                        </TextEquiv>
                    </TextLine>
                </TextRegion>
            </Page>
        </PcGts>

I tried to use some of the suggestions in the links above. but:

 import lxml.etree as ET
 data = ET.parse(file)
 root = data.getroot()
 for x in root.xpath("//Unicode"):
     print(x.text)

didn't find any tag. and another try:

for x in root.xpath("//{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Unicode"):
   print(x.text)

throws "XPathEvalError: Invalid expression"

Well, what is the simplest way to remove all nodes whose Unicode tag is empty from this XML file (and how to find them at all?)?

Thanks.

Jack Fleeting · Answer 1 · 2019-10-20T12:59:49.310

First, you xml is missing a closing tag for <TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">, but if you insert it in the approprite place, the following should get you there:

my_xml = """[your xml above, corrected]"""
data = ET.XML(my_xml.encode('ascii'))
for target in data.xpath("//*[local-name() = 'Unicode'][not(text())]"):
    target.getparent().remove(target)     

print(etree.tostring(data,  xml_declaration=True))

Output:

    <?xml version=\'1.0\' encoding=\'ASCII\'?>\n
<PcGts
    xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
    <Page imageFilename="1.png">
        <TextRegion custom="a">
            <TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
                <TextEquiv>
                    <Unicode> abc </Unicode>
                </TextEquiv>
            </TextLine>
            <TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">
                <TextEquiv/>
            </TextLine>
        </TextRegion>
    </Page>
</PcGts>

Hi, thanks for your answer. But - note that your output is not the desired result for me. I want to remove the node in XML at a higher point in the tree - starting from the "TextLine". What should be done to remove the entire cluster? By the way, "etree" in the last line should be "ET", and the desired encoding is: "UTF-8". thanks again @jack. — Yanirmr, Oct 22 '19 at 05:24

score 0 · Accepted Answer · answered Oct 24 '19 at 05:46

Well, I finally found a solution to the problem.

import lxml.etree as ET
my_xml = """...xml content..."""
data = ET.XML(my_xml.encode('UTF-8'))

#this loop remove "<Unicode />" tags.
for target in data.xpath("//*[local-name() = 'Unicode'][not(text())]"):
    target.getparent().remove(target)  

#and this loop remove nodes without children like "<TextEquiv><Unicode /></TextEquiv>" 
#(after the removing of "<Unicode />")
for el in data.iter():
    if len(list(el.iterchildren())) or ''.join([_.strip() for _ in el.itertext()]):
        pass
    else:
        parent = el.getparent()
        if parent is not None:
            parent.remove(el)
#and this loop remove nodes without children again, but now - it's "<TextLine>" tag
for el in data.iter():
    if len(list(el.iterchildren())) or ''.join([_.strip() for _ in el.itertext()]):
        pass
    else:
        parent = el.getparent()
        if parent is not None:
            parent.remove(el)

print(ET.tostring(data,  xml_declaration=True))

the idea came from Remove xml nodes without child nodes using python

How to find and remove elements in XML file (with name_spaces) by condition with Python

2 Answers2