1

I was trying to get all "points" attribute values from "TextRegion--> Coords" tag. I keep getting errors from it. Note: there are tags called "TextRegion" and "ImageRegion" which both contain "Coords". However, I only want the Coords points from "TextRegion".

Please help! Thank you!!

Here is my xml file:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
    <Metadata>
        <Creator/>
        <Created>2021-01-24T17:11:35</Created>
        <LastChange>1969-12-31T19:00:00</LastChange>
        <Comments/>
    </Metadata>
    <Page imageFilename="0004.png" imageHeight="3655" imageWidth="2493">
        <TextRegion id="r1" type="paragraph">
            <Coords points="1653,146 1651,148"/>
            <TextEquiv>
                <Unicode/>
            </TextEquiv>
        </TextRegion>
        <TextRegion id="r2" type="paragraph">
            <Coords points="2071,326 2069,328 2058,328 2055"/>
            <TextEquiv>
                <Unicode/>
            </TextEquiv>
        </TextRegion>
        <ImageRegion id="r3">
            <Coords points="443,621 443,2802 2302,2802 2302,621"/>
        </ImageRegion>
        <TextRegion id="r4" type="paragraph">
            <Coords points="2247,2825 2247,2857 2266,2857 2268,2860 2268"/>
            <TextEquiv>
                <Unicode/>
            </TextEquiv>
        </TextRegion>
        <TextRegion id="r5" type="paragraph">
            <Coords points="731,2828 731,2839 728,2841"/>
            <TextEquiv>
                <Unicode/>
            </TextEquiv>
        </TextRegion>
    </Page>
</PcGts>

Here is my code:

from lxml import etree as ET

tree = ET.parse('0004.xml')
root = tree.getroot()
print(root.tag)

for tag in root.find_all('Page/TextRegion/Coords'):
    value = tag.get('points')
    print(value)
pickle san
  • 43
  • 6
  • Your XML is not well-formed. The opening root does not have a closing bracket `>`. This will raise an error on `parse`. – Parfait Jan 26 '21 at 17:09

1 Answers1

1

Assuming your posted XML is a copy/paste issue with missing closing of root element opening, your other main issue is the classic XML parsing issue which involves parsing nodes under a default namespace which includes any attribute starting with xmlns without a colon separated prefix like xmlns:doc="...".

As a result, you need to define a temporary namespace prefix in Python to parse each named element which you can do with a dictionary passed into findall (not find_all).

from lxml import etree as ET

tree = ET.parse('0004.xml')
nsmp = {'doc': 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15'}

root = tree.getroot()
print(root.tag)

# SPECIFY NAMESPACE AND PREFIX ALL NAMED ELEMENTS
for tag in root.findall('doc:Page/doc:TextRegion/doc:Coords', namespaces=nsmp):
    value = tag.get('points')
    print(value)

# 1653,146 1651,148
# 2071,326 2069,328 2058,328 2055
# 2247,2825 2247,2857 2266,2857 2268,2860 2268
# 731,2828 731,2839 728,2841

By the way, lxml is a feature-rich XML library (that required 3rd party installation) that among other powerful tools supports full XPath 1.0. The above code can still work with Python's built-in etree simply by changing import line as from xml.etree import ElementTree as ET.

However, lxml extends this library such as parsing directly to attributes with xpath:

tree = ET.parse('0004.xml')

# SPECIFY NAMESPACE AND PREFIX ALL NAMED ELEMENTS
for pts in tree.xpath('//doc:Coords/@points', namespaces=nsmp):
    print(pts)

# 1653,146 1651,148
# 2071,326 2069,328 2058,328 2055
# 2247,2825 2247,2857 2266,2857 2268,2860 2268
# 731,2828 731,2839 728,2841
Parfait
  • 104,375
  • 17
  • 94
  • 125