1

I have an XML file that I want to remove elements from based on conditions. However, the XML file has namespaces which for some unclear reason do not allow me to perform the procedures described: 1, 2, 3, 4 and 5.

My XML looks like this:

    <?xml version='1.0' encoding='UTF-8'?>
        <PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
            <Page imageFilename="1.png">
                <TextRegion custom="a">
                    <TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
                        <TextEquiv>
                            <Unicode> abc </Unicode>
                        </TextEquiv>
                    </TextLine>
                    <TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">
                        <TextEquiv>
                            <Unicode />
                        </TextEquiv>
                </TextRegion>
            </Page>
        </PcGts>

My goal is to clear all TextLine's nodes where there is no text in the "Unicode" tag. So the output will be:

    <?xml version='1.0' encoding='UTF-8'?>
        <PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
            <Page imageFilename="1.png">
                <TextRegion custom="a">
                    <TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
                        <TextEquiv>
                            <Unicode> abc </Unicode>
                        </TextEquiv>
                    </TextLine>
                </TextRegion>
            </Page>
        </PcGts>

I tried to use some of the suggestions in the links above. but:

 import lxml.etree as ET
 data = ET.parse(file)
 root = data.getroot()
 for x in root.xpath("//Unicode"):
     print(x.text)

didn't find any tag. and another try:

for x in root.xpath("//{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Unicode"):
   print(x.text)

throws "XPathEvalError: Invalid expression"

Well, what is the simplest way to remove all nodes whose Unicode tag is empty from this XML file (and how to find them at all?)?

Thanks.

Yanirmr
  • 923
  • 8
  • 25

2 Answers2

1

First, you xml is missing a closing tag for <TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">, but if you insert it in the approprite place, the following should get you there:

my_xml = """[your xml above, corrected]"""
data = ET.XML(my_xml.encode('ascii'))
for target in data.xpath("//*[local-name() = 'Unicode'][not(text())]"):
    target.getparent().remove(target)     

print(etree.tostring(data,  xml_declaration=True))

Output:

    <?xml version=\'1.0\' encoding=\'ASCII\'?>\n
<PcGts
    xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
    <Page imageFilename="1.png">
        <TextRegion custom="a">
            <TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
                <TextEquiv>
                    <Unicode> abc </Unicode>
                </TextEquiv>
            </TextLine>
            <TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">
                <TextEquiv/>
            </TextLine>
        </TextRegion>
    </Page>
</PcGts>  
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • Hi, thanks for your answer. But - note that your output is not the desired result for me. I want to remove the node in XML at a higher point in the tree - starting from the "TextLine". What should be done to remove the entire cluster? By the way, "etree" in the last line should be "ET", and the desired encoding is: "UTF-8". thanks again @jack. – Yanirmr Oct 22 '19 at 05:24
0

Well, I finally found a solution to the problem.

import lxml.etree as ET
my_xml = """...xml content..."""
data = ET.XML(my_xml.encode('UTF-8'))

#this loop remove "<Unicode />" tags.
for target in data.xpath("//*[local-name() = 'Unicode'][not(text())]"):
    target.getparent().remove(target)  

#and this loop remove nodes without children like "<TextEquiv><Unicode /></TextEquiv>" 
#(after the removing of "<Unicode />")
for el in data.iter():
    if len(list(el.iterchildren())) or ''.join([_.strip() for _ in el.itertext()]):
        pass
    else:
        parent = el.getparent()
        if parent is not None:
            parent.remove(el)
#and this loop remove nodes without children again, but now - it's "<TextLine>" tag
for el in data.iter():
    if len(list(el.iterchildren())) or ''.join([_.strip() for _ in el.itertext()]):
        pass
    else:
        parent = el.getparent()
        if parent is not None:
            parent.remove(el)

print(ET.tostring(data,  xml_declaration=True))

the idea came from Remove xml nodes without child nodes using python

Yanirmr
  • 923
  • 8
  • 25