case insensitive xml and python

Question

I got this piece of code and I am trying to read all the 'ref' 'href' tags. I am not sure how to make this to be case insensitive as some of my xml files have REF or Ref or ref. Any suggestions?

    f = urllib.urlopen(url)
    tree = ET.parse(f)
    root = tree.getroot()

    for child in root.iter('ref'):
      t = child.get('href')
      if t not in self.href:
        self.href.append(t)
        print self.href[-1]

Sounds like you need to fix up the XML as it is parsed so once it has been loaded into root the case is normalized - see for example nonagon's answer here how to get at the tags as they are parsed, sure you can figure out how to lowercase them http://stackoverflow.com/questions/13412496/python-elementtree-module-how-to-ignore-the-namespace-of-xml-files-to-locate-ma/33997423#33997423 — DisappointedByUnaccountableMod, Mar 01 '16 at 11:50
Also see answer here - not the most subtle of techniques but gets everything the same case - http://stackoverflow.com/questions/9440896/case-insensitive-findall-in-python-elementtree — DisappointedByUnaccountableMod, Mar 01 '16 at 11:53

Mini Fridge · Answer 1 · 2016-03-01T13:14:43.687

You can normalize tags and attributes by converting them to lowercase using the functions below as a step of preprocessing:

import xml.etree.ElementTree as ET
f = urllib.urlopen(url)
tree = ET.parse(f)
root = tree.getroot()

def normalize_tags(root):
    root.tag = root.tag.lower()
    for child in root:
        normalize_tags(child)

def normalize_attr(root):
    for attr,value in root.attrib.items():
        norm_attr = attr.lower()
        if norm_attr != attr:
            root.set(norm_attr,value)
            root.attrib.pop(attr)

    for child in root:
        normalize_attr(child)


normalize_tags(root)    
normalize_attr(root)
print(ET.tostring(root))

thanks mate, it works fine for most of them but for some i get an error: xml.etree.ElementTree.ParseError: mismatched tag: line 4, column 5 — Adam, Mar 04 '16 at 16:36

score 0 · Answer 2 · answered Mar 01 '16 at 11:54

The following should help

f = urllib.urlopen(url)
tree = ET.parse(f)
root = tree.getroot()

for child in root:
  if child.tag.lower() == 'ref':
    t = child.attribute.get('href')
    if t not in self.href:
      self.href.append(t)
      print self.href[-1]

score 0 · Answer 3 · edited May 23 '17 at 12:16

0

If you are using lxml then one option is to use XPath with regular expressions through XSLT extensions (https://stackoverflow.com/a/2756994/2997179):

root.xpath("./*[re:test(local-name(), '(?i)href')]",
    namespaces={"re": "http://exslt.org/regular-expressions"})

edited May 23 '17 at 12:16

Community

1
1

answered Mar 01 '16 at 12:16

Martin Valgur

5,793
1
33
45

case insensitive xml and python

3 Answers3