3

I'm trying to write a simple program to read my financial XML files from GNUCash, and learn Python in the process.

The XML looks like this:

<?xml version="1.0" encoding="utf-8" ?>
<gnc-v2
     xmlns:gnc="http://www.gnucash.org/XML/gnc"
     xmlns:act="http://www.gnucash.org/XML/act"
     xmlns:book="http://www.gnucash.org/XML/book"
     {...}
     xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:count-data cd:type="book">1</gnc:count-data>
<gnc:book version="2.0.0">
<book:id type="guid">91314601aa6afd17727c44657419974a</book:id>
<gnc:count-data cd:type="account">80</gnc:count-data>
<gnc:count-data cd:type="transaction">826</gnc:count-data>
<gnc:count-data cd:type="budget">1</gnc:count-data>
<gnc:commodity version="2.0.0">
  <cmdty:space>ISO4217</cmdty:space>
  <cmdty:id>BRL</cmdty:id>
  <cmdty:get_quotes/>
  <cmdty:quote_source>currency</cmdty:quote_source>
  <cmdty:quote_tz/>
</gnc:commodity>

Right now, i'm able to iterate and get results using

import xml.etree.ElementTree as ET 
r = ET.parse("file.xml").findall('.//') 

after manually cleaning the namespaces, but I'm looking for a solution that could either read the entries regardless of their namespaces OR remove the namespaces before parsing.

Note that I'm a complete noob in python, and I've read: Python and GnuCash: Extract data from GnuCash files, Cleaning an XML file in Python before parsing and python: xml.etree.ElementTree, removing "namespaces" along with ElementTree docs and I'm still lost...

I've come up with this solution:

def strip_namespaces(self, tree):

    nspOpen = re.compile("<\w*:", re.IGNORECASE)
    nspClose = re.compile("<\/\w*:", re.IGNORECASE)

    for i in tree:
        start = re.sub(nspOpen, '<', tree.tag)          
        end = re.sub(nspOpen, '<\/', tree.tag)

    # pprint(finaltree)
    return

But I'm failing to apply it. I can't seem to be able to retrieve the tag names as they appear on the file.

Cœur
  • 37,241
  • 25
  • 195
  • 267
moraleida
  • 424
  • 6
  • 22
  • it is not clear from your question what is your expected output or what kind of data you are trying to extract. – elyase May 20 '13 at 00:35
  • I want to either be able to parse the file removing prefixes and namespaces `(eg.: becomes )` or reference the elements ignoring the prefix `(eg.: element.findall('book/transaction') for )` – moraleida May 20 '13 at 00:43
  • 1
    Try lxml. It's a different XML library for python and understands namespaces. – tdelaney May 20 '13 at 04:31
  • This answer might help: http://stackoverflow.com/a/11227304/407651. – mzjn May 23 '13 at 05:58
  • If you want to use python for gnucash, I would recommend exploring my package piecash http://piecash.readthedocs.io/en/latest/. It works with gnucash books saved in one of the SQL formats – sdementen Nov 26 '17 at 05:53
  • xml is not formed well , xml structure itself having issue.@moraleida – Ahito Dec 21 '18 at 10:34

1 Answers1

0

I think below python code will be helpfull to you.

sample.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gnc:prodinfo xmlns:gnc="http://www.gnucash.org/XML/gnc"
     xmlns:act="http://www.gnucash.org/XML/act"
     xmlns:book="http://www.gnucash.org/XML/book"
     xmlns:vendor="http://www.gnucash.org/XML/vendor">
    <gnc:change>
        <gnc:lastUpdate>2018-12-21
        </gnc:lastUpdate>
    </gnc:change>
    <gnc:bill>
        <gnc:billAccountNumber>1234</gnc:billAccountNumber>
        <gnc:roles>
            <gnc:id>111111</gnc:id>
            <gnc:pos>2</gnc:pos>
            <gnc:genid>15</gnc:genid>
        </gnc:roles>
    </gnc:bill>
    <gnc:prodtyp>sales and service</gnc:prodtyp>
</gnc:prodinfo>

PYTHON CODE: to remove xmlns for root tag.

import xml.etree.cElementTree as ET

def xmlns(str):
    str1 = str.split('{')
    l=[]
    for i in str1:
        if '}' in i:
            l.append(i.split('}')[1])
        else:
            l.append(i)
    var = ''.join(l)
    return var


tree=ET.parse('sample.xml')
root = tree.getroot()
print(root.tag)   #returns root tag with xmlns as prefix 
print(xmlns(root.tag)) #returns root tag with out xmlns as prefix

Output:

{http://www.gnucash.org/XML/gnc}prodinfo prodinfo

Ahito
  • 333
  • 3
  • 8
  • 15