Structure of XML file is preventing me from reading it with python

Question

I'm setting up a python script that will ask for a list of input xml files that all have the same format and read out a specific line from each xml file.

Everything works as I want it to, however I am getting an error when reading from the xml file due to the content of the xml file itself.

I have got the script to work by editing the xml file but this is not a solution for me as I need this script to run thousands of files

here is the code I'm using:

import os
import tkinter as tk
from tkinter import filedialog
import xml.etree.ElementTree as ET


root = tk.Tk()
root.withdraw()

file_path = filedialog.askopenfilenames()

tup=0

count = len(file_path)

for i in range(len(file_path)):
    filename = os.path.basename(file_path[tup])
    print('file =',os.path.basename(' '.join(file_path)))
    tree = ET.parse(file_path[tup])
    root = tree.getroot()
    for child in root:
        data = child.tag
        print(data)
    for data in root.findall(data):
        name = data.find('subdata2').text
        print('ID =', name)
    tup +=1

and here is an example of the xml:

<?xml version="1.0"?>
<Data xmlns="link">
    <subdata1 id = "something">
        <subdata2>data
            <subdata3>data</subdata3>
        </subdata2>
    </subdata1>
</Data>

The problem comes from the text attached to the root "link3" it changes the tag of subdata1 from

subdata1

to

 {link}subdata1

and this is then changing the output from:

ID = data

to:

Traceback (most recent call last):
  File "debug.py", line 25, in <module>
    name = data.find('subdata2').text
AttributeError: 'NoneType' object has no attribute 'text'

is there another way of extracting the data from this xml file that doesn't involve modifying the xml file itself?

Matt M · Accepted Answer · 2019-05-10T11:30:01.837

You can strip the namespaces from the parsed xml instead of the xml itself.

tree = ET.iterparse(file_path)
for _, el in tree:
    if '}' in el.tag:
        el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
root = tree.root
for child in root:
    # ... (REST OF CODE)

Read more here

Also, another option if you don't mind a lack of speed but want ultimate simplicity, you can use untangle. Being as how your XML is apparently all structured the same, this might be convenient for you.

import untangle

root = untangle.parse(file_path)
print(root.Data.subdata1['id'])
print(root.Data.subdata1.subdata2.cdata)

I also forgot my favorite option. xmltodict converts xml into Python OrderedDict objects.

import xmltodict

with open(xmlPath, 'rb') as fd:
    xmlDict = xmltodict.parse(fd)
print(xmlDict['Data']['subdata1']['@id'])
print(xmlDict['Data']['subdata1']['subdata2']['#text'])

As you can see, namespaces won't be an issue. And if you are familiar with Python dicts then it will be very simple to iterate through and find what you want.

Structure of XML file is preventing me from reading it with python

1 Answers1

Linked