Programmatically clean/ignore namespaces in XML - python

Question

I'm trying to write a simple program to read my financial XML files from GNUCash, and learn Python in the process.

The XML looks like this:

<?xml version="1.0" encoding="utf-8" ?>
<gnc-v2
     xmlns:gnc="http://www.gnucash.org/XML/gnc"
     xmlns:act="http://www.gnucash.org/XML/act"
     xmlns:book="http://www.gnucash.org/XML/book"
     {...}
     xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:count-data cd:type="book">1</gnc:count-data>
<gnc:book version="2.0.0">
<book:id type="guid">91314601aa6afd17727c44657419974a</book:id>
<gnc:count-data cd:type="account">80</gnc:count-data>
<gnc:count-data cd:type="transaction">826</gnc:count-data>
<gnc:count-data cd:type="budget">1</gnc:count-data>
<gnc:commodity version="2.0.0">
  <cmdty:space>ISO4217</cmdty:space>
  <cmdty:id>BRL</cmdty:id>
  <cmdty:get_quotes/>
  <cmdty:quote_source>currency</cmdty:quote_source>
  <cmdty:quote_tz/>
</gnc:commodity>

Right now, i'm able to iterate and get results using

import xml.etree.ElementTree as ET 
r = ET.parse("file.xml").findall('.//')

after manually cleaning the namespaces, but I'm looking for a solution that could either read the entries regardless of their namespaces OR remove the namespaces before parsing.

Note that I'm a complete noob in python, and I've read: Python and GnuCash: Extract data from GnuCash files, Cleaning an XML file in Python before parsing and python: xml.etree.ElementTree, removing "namespaces" along with ElementTree docs and I'm still lost...

I've come up with this solution:

def strip_namespaces(self, tree):

    nspOpen = re.compile("<\w*:", re.IGNORECASE)
    nspClose = re.compile("<\/\w*:", re.IGNORECASE)

    for i in tree:
        start = re.sub(nspOpen, '<', tree.tag)          
        end = re.sub(nspOpen, '<\/', tree.tag)

    # pprint(finaltree)
    return

But I'm failing to apply it. I can't seem to be able to retrieve the tag names as they appear on the file.

it is not clear from your question what is your expected output or what kind of data you are trying to extract. — elyase, May 20 '13 at 00:35
I want to either be able to parse the file removing prefixes and namespaces `(eg.: becomes )` or reference the elements ignoring the prefix `(eg.: element.findall('book/transaction') for )` — moraleida, May 20 '13 at 00:43
Try lxml. It's a different XML library for python and understands namespaces. — tdelaney, May 20 '13 at 04:31
This answer might help: http://stackoverflow.com/a/11227304/407651. — mzjn, May 23 '13 at 05:58
If you want to use python for gnucash, I would recommend exploring my package piecash http://piecash.readthedocs.io/en/latest/. It works with gnucash books saved in one of the SQL formats — sdementen, Nov 26 '17 at 05:53
xml is not formed well , xml structure itself having issue.@moraleida — Ahito, Dec 21 '18 at 10:34

score 0 · Answer 1 · answered Dec 21 '18 at 10:57

I think below python code will be helpfull to you.

sample.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gnc:prodinfo xmlns:gnc="http://www.gnucash.org/XML/gnc"
     xmlns:act="http://www.gnucash.org/XML/act"
     xmlns:book="http://www.gnucash.org/XML/book"
     xmlns:vendor="http://www.gnucash.org/XML/vendor">
    <gnc:change>
        <gnc:lastUpdate>2018-12-21
        </gnc:lastUpdate>
    </gnc:change>
    <gnc:bill>
        <gnc:billAccountNumber>1234</gnc:billAccountNumber>
        <gnc:roles>
            <gnc:id>111111</gnc:id>
            <gnc:pos>2</gnc:pos>
            <gnc:genid>15</gnc:genid>
        </gnc:roles>
    </gnc:bill>
    <gnc:prodtyp>sales and service</gnc:prodtyp>
</gnc:prodinfo>

PYTHON CODE: to remove xmlns for root tag.

import xml.etree.cElementTree as ET

def xmlns(str):
    str1 = str.split('{')
    l=[]
    for i in str1:
        if '}' in i:
            l.append(i.split('}')[1])
        else:
            l.append(i)
    var = ''.join(l)
    return var


tree=ET.parse('sample.xml')
root = tree.getroot()
print(root.tag)   #returns root tag with xmlns as prefix 
print(xmlns(root.tag)) #returns root tag with out xmlns as prefix

Output:

{http://www.gnucash.org/XML/gnc}prodinfo prodinfo

Programmatically clean/ignore namespaces in XML - python

1 Answers1