I have an XML file that I want to remove elements from based on conditions. However, the XML file has namespaces which for some unclear reason do not allow me to perform the procedures described: 1, 2, 3, 4 and 5.
My XML looks like this:
<?xml version='1.0' encoding='UTF-8'?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
<Page imageFilename="1.png">
<TextRegion custom="a">
<TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
<TextEquiv>
<Unicode> abc </Unicode>
</TextEquiv>
</TextLine>
<TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">
<TextEquiv>
<Unicode />
</TextEquiv>
</TextRegion>
</Page>
</PcGts>
My goal is to clear all TextLine's nodes where there is no text in the "Unicode" tag. So the output will be:
<?xml version='1.0' encoding='UTF-8'?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
<Page imageFilename="1.png">
<TextRegion custom="a">
<TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
<TextEquiv>
<Unicode> abc </Unicode>
</TextEquiv>
</TextLine>
</TextRegion>
</Page>
</PcGts>
I tried to use some of the suggestions in the links above. but:
import lxml.etree as ET
data = ET.parse(file)
root = data.getroot()
for x in root.xpath("//Unicode"):
print(x.text)
didn't find any tag. and another try:
for x in root.xpath("//{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Unicode"):
print(x.text)
throws "XPathEvalError: Invalid expression"
Well, what is the simplest way to remove all nodes whose Unicode tag is empty from this XML file (and how to find them at all?)?
Thanks.