0

I am trying to use Python to parse an xml file. I would like to identify text which occurs between specified xml tags.

The code I am running is


import xml.etree.ElementTree as ET
tree = ET.parse('020012_doctored.xml')
root = tree.getroot()
for w in root.iter('w'):
    print(w.text)

The xml file is as follows. It's a complex file with quite a loose structure, which combines elements of sequence and hierarchy (and I have simplified it for the purposes of this query), but there clearly is a "w" tag, which should be getting picked up by the code.

Thanks.

<?xml version="1.0" encoding="UTF-8"?>

<CHAT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://www.talkbank.org/ns/talkbank"
      xsi:schemaLocation="http://www.talkbank.org/ns/talkbank https://talkbank.org/software/talkbank.xsd"
      Media="020012" Mediatypes="audio"
            DesignType="long"
            ActivityType="toyplay"
            GroupType="TD"
      PID="11312/c-00018213-1"
      Version="2.20.0"
      Lang="eng"
      Options="bullets"
      Corpus="xxxx"
      Date="xxxx-xx-xx"
      >
  <Participants>
    <participant
      id="MOT"
    name="Mother"
      role="Mother"
      language="eng"
      sex="female"
    />
  </Participants>
  <comment type="Date">15-APR-1999</comment>
  <u who="INV" uID="u0">
    <w untranscribed="untranscribed">www</w>
    <t type="p"></t>
    <media
      start="7.639"
      end="9.648"
      unit="s"
    />
    <a type="addressee">MOT</a>
  </u>
  <u who="MOT" uID="u1">
    <w untranscribed="untranscribed">www</w>
    <t type="p"></t>
    <media
      start="7.640"
      end="9.455"
      unit="s"
    />
    <a type="addressee">INV</a>
  </u>
  <u who="CHI" uID="u2">
    <w untranscribed="unintelligible">xxx</w>
    <w formType="family-specific">choo_choos<mor type="mor"><mw><pos><c>fam</c></pos><stem>choo_choos</stem></mw><gra type="gra" index="1" head="0" relation="INCROOT"/></mor></w>
    <t type="p"><mor type="mor"><mt type="p"/><gra type="gra" index="2" head="1" relation="PUNCT"/></mor></t>
    <postcode>I</postcode>
    <media
      start="10.987"
      end="12.973"
      unit="s"
    />
    <a type="comments">looking at pictures of trains</a>
  </u>

  </CHAT>

Nick Riches
  • 317
  • 2
  • 13
  • It's a common gotcha. There are many, many similar questions. The document declares a default namespace: `http://www.talkbank.org/ns/talkbank`. Therefore you need to search for `w` elements using `root.iter('{http://www.talkbank.org/ns/talkbank}w')`. – mzjn Jul 31 '23 at 15:42
  • Yes, I saw some discussions of namespaces, but I found it difficult to wrap my head around it. – Nick Riches Jul 31 '23 at 16:03

3 Answers3

1

I think you have to prepend the namespace:

for w in root.iter("{http://www.talkbank.org/ns/talkbank}w"):
    print(w.text)

You might want to checkout this question for more similar problem with namespaces.

Runinho
  • 439
  • 1
  • 6
1

You can also define the namespace for further usage and use iterfind:

NS = { 'ww' : 'http://www.talkbank.org/ns/talkbank' }
for w in root.iterfind('.//ww:w',NS):
    print(w.text)

Result would be

www
www
xxx
choo_choos
zx485
  • 28,498
  • 28
  • 50
  • 59
1

Your xml has namespaces and nested tag into one tag. I changed your code a little bit:

import xml.etree.ElementTree as ET

tree = ET.parse('020012_doctored.xml')
root = tree.getroot()
for w in root.findall('.//{*}w'):
    print("".join(w.itertext()))

Output:

www
www
xxx
choo_choosfamchoo_choos
Hermann12
  • 1,709
  • 2
  • 5
  • 14
  • Cheers. That's neat. Still trying to understand xquery. I think the asterisk overcomes the namespace issue? – Nick Riches Aug 01 '23 at 07:02
  • 1
    @NickRiches: This is not about XQuery; it is about XPath (specifically the flavour of XPath supported by ElementTree: https://docs.python.org/3/library/xml.etree.elementtree.html#elementtree-xpath) – mzjn Aug 02 '23 at 16:25