How to parse XML with Python ElementTree when there is a colon in the namespace

Question

I have the same question as this one, but I struggle to get it working.

I want to get all the values of <itunes:subtitle>. Not to be confused with the self-closing tag <itunes:subtitle/>.

This is my XML data in sample.xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rss version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <item>
      <title>A title</title>
      <itunes:subtitle/>
      <itunes:subtitle>A subtitle</itunes:subtitle>
    </item>
    <item>
      <title><![CDATA[Another title]]></title>
      <itunes:subtitle/>
      <itunes:subtitle>Yet another subtitlen</itunes:subtitle>
    </item>
  </channel>
</rss>

import xml.etree.ElementTree as ET

with open('sample.xml', 'r', encoding='utf8') as f:
    tree = ET.parse(f)
    root = tree.getroot()

for xml_item in root.iter('item'):
    namespaces = {'itunes': 'subtitle'}
    print(root.findall('itunes:subtitle', namespaces))

However, this returns empty lists.

[]
[]

I could not find any meaningful help in the other 9-year-old question or elsewhere on Stackoverflow. Please help me out.

score 1 · Answer 1 · answered Oct 22 '22 at 15:42

First, look at the namespaces declared in the <rss> element:

<rss version="2.0"
  xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
  xmlns:content="http://purl.org/rss/1.0/modules/content/">

The alias itunes refers to the namespace http://www.itunes.com/dtds/podcast-1.0.dtd, and the alias content refers to the namespace http://purl.org/rss/1.0/modules/content/.

In your Python code, the namespaces dictionary needs to reflect those mappings, so:

namespaces = {'itunes': 'http://www.itunes.com/dtds/podcast-1.0.dtd'}

Second, from the documentation::

Element.findall() finds only elements with a tag which are direct children of the current element.

In your loop...

for xml_item in root.iter('item'):
    print(root.findall('itunes:subtitle', namespaces))

...your loop variable is xml_item, but you're calling findall on root -- and your <itunes:subtitle> elements are not direct children of the rss tag. You need:

for xml_item in root.iter('item'):
    print(xml_item.findall('itunes:subtitle', namespaces))

Given us:

import xml.etree.ElementTree as ET

with open('sample.xml', 'r', encoding='utf8') as f:
    tree = ET.parse(f)
    root = tree.getroot()

namespaces = {'itunes': 'http://www.itunes.com/dtds/podcast-1.0.dtd'}

for xml_item in root.iter('item'):
    print(xml_item.findall('itunes:subtitle', namespaces))

Running the above code produces:

[<Element '{http://www.itunes.com/dtds/podcast-1.0.dtd}subtitle' at 0x7fb46ab48770>, <Element '{http://www.itunes.com/dtds/podcast-1.0.dtd}subtitle' at 0x7fb46ab487c0>]
[<Element '{http://www.itunes.com/dtds/podcast-1.0.dtd}subtitle' at 0x7fb46ab488b0>, <Element '{http://www.itunes.com/dtds/podcast-1.0.dtd}subtitle' at 0x7fb46ab48900>]

This is a much more elaborate answer than what I could find. Thanks. Although you seem to have missed out on how to exclude the self-closing tags. Perhaps another question, that is. — Arete, Oct 22 '22 at 16:33
"Although you seem to have missed out on how to exclude the self-closing tags." No, I intentionally left that out. Don't think of them as "self-closing tags"; think of them as "empty tags". You can exclude them by checking to see if they have any content (e.g., check for `element.text`). — larsks, Oct 22 '22 at 16:47

How to parse XML with Python ElementTree when there is a colon in the namespace

1 Answers1