0

I am trying to parse XML, where the URI for the same namespace is not using the same case. (some xml owners decided to lower-case URIs). If I parse data with one type of URI followed by data with the other type, the parser fail finding my data although I update the ns dictionary to match the document URI... Here is an example:

from cStringIO import StringIO
import xml.etree.ElementTree as ET

DATA_lc = '''<?xml version="1.0" encoding="utf-8"?>
<container xmlns:roktatar="http://www.example.com/lower/case/bug">
<item>
   <roktatar:author>Boby Mac Gallinger</roktatar:author>
</item>
</container>'''

DATA_UC = '''<?xml version="1.0" encoding="utf-8"?>
<container xmlns:roktatar="http://www.example.com/Lower/Case/Bug">
<item>
   <roktatar:author>John-John Le Grandiosant</roktatar:author>
</item>
</container>'''

tree = ET.parse(StringIO(DATA_lc))
root = tree.getroot()
ns = {'roktatar': 'http://www.example.com/lower/case/bug'}
for item in root.iter('item'):
    print item.find('roktatar:author', namespaces=ns).text.strip()

tree = ET.parse(StringIO(DATA_UC))
root = tree.getroot()
ns = {'roktatar': 'http://www.example.com/Lower/Case/Bug'}
for item in root.iter('item'):
    print item.find('roktatar:author', namespaces=ns).text.strip()

If each parsing block is processed on it's own, the data gets collected properly, but if they come next to each others, the second always fail. I am missing so reset/cleaning of the parser between documents? Is this a Bug?

Thanks

Fafaman
  • 141
  • 10
  • Paths in URLs are **not** case insensitive. You actually have **different** namespaces here. See [Should url be case sensitive?](http://stackoverflow.com/q/7996919) – Martijn Pieters Jul 21 '14 at 18:35
  • I agree with you @Martjin, the question should be narrowed to why elementtree does not support changing URI for the same namespace? – Fafaman Jul 21 '14 at 18:49
  • Ah, I see what you mean here; there is a cache that needs clearing. I'll write an answer. – Martijn Pieters Jul 21 '14 at 18:56
  • As written, it sounds like you're asking a question about the design of `ElementTree`, which was answered in Martijn's first comment. If you're actually asking how to change URLs for the same namespace, ask that. – abarnert Jul 21 '14 at 19:29

1 Answers1

2

The ElementTree search code parses arguments to find() and related functions for XPath expressions, and caches the resulting closed-over functions for reuse.

When you search for a roktatar:author, that expression is cached as a search for '{http://www.example.com/lower/case/bug}author', but in your second document the binding changed.

In other words, ElementTree assumes that the same namespace prefix will always map to the same namespace URI.

The better solution to this problem is to use a different prefix here, like roktatar_uc for the title-case version of the URL:

ns = {'roktatar_uc': 'http://www.example.com/Lower/Case/Bug'}
for item in root.iter('item'):
    print item.find('roktatar_uc:author', namespaces=ns).text.strip()

but if that is not an option, you'll have to clear the cache instead:

from xml.etree import ElementPath

ElementPath._cache.clear()
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I have to use the _cache.clear() method: I do not have control over the XMLs. Thanks, I guess this is documented in ElementPath but I was far from looking there! – Fafaman Jul 21 '14 at 19:14
  • 2
    @Fafaman: You may not be able to alter the namespaces in the XML, but you *can* alter your `namespaces` dictionary and prefix. – Martijn Pieters Jul 21 '14 at 19:16