I was trying to get all "points" attribute values from "TextRegion--> Coords" tag. I keep getting errors from it. Note: there are tags called "TextRegion" and "ImageRegion" which both contain "Coords". However, I only want the Coords points from "TextRegion".
Please help! Thank you!!
Here is my xml file:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
<Metadata>
<Creator/>
<Created>2021-01-24T17:11:35</Created>
<LastChange>1969-12-31T19:00:00</LastChange>
<Comments/>
</Metadata>
<Page imageFilename="0004.png" imageHeight="3655" imageWidth="2493">
<TextRegion id="r1" type="paragraph">
<Coords points="1653,146 1651,148"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextRegion>
<TextRegion id="r2" type="paragraph">
<Coords points="2071,326 2069,328 2058,328 2055"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextRegion>
<ImageRegion id="r3">
<Coords points="443,621 443,2802 2302,2802 2302,621"/>
</ImageRegion>
<TextRegion id="r4" type="paragraph">
<Coords points="2247,2825 2247,2857 2266,2857 2268,2860 2268"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextRegion>
<TextRegion id="r5" type="paragraph">
<Coords points="731,2828 731,2839 728,2841"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextRegion>
</Page>
</PcGts>
Here is my code:
from lxml import etree as ET
tree = ET.parse('0004.xml')
root = tree.getroot()
print(root.tag)
for tag in root.find_all('Page/TextRegion/Coords'):
value = tag.get('points')
print(value)