0

I'm attempting to parse XML and store the element values into an object. The issue I'm running into is that the child elements are repeated so I'm not quite sure the best practice to iterate over it and store the value.

What I was considering doing is looking at a child element and adding a counter. The counter would be used to create an undetermined amount of object containers to store the values. Would this work or is there a better way to do this?

Here is an example of my class:

class SODOCUMENTITEMS:

def __init__(self):
    self.recordno = ''
    self.dochdrno = ''
    self.docid = ''

and here is an example of my XML:

`<sotransitems>
  <sotransitem>
    <recordno>40562</recordno>
    <dochdrno>16987</dochdrno>
    <docid/>
    <bundlenumber/>
    <itemid>13</itemid>
    <itemdesc>Winter Lager</itemdesc>
    <line_no>0</line_no>
    <warehouseid>Main</warehouseid>
    <quantity>1</quantity>
    <unit>Each</unit>
    <price>4.99</price>
    <retailprice>4.99</retailprice>
    <totalamount>4.99</totalamount>
    <taxrate/>
    <tax/>
    <grossamount/>
    <locationid/>
    <departmentid/>
    <memo/>
    <discsurchargememo/>
    <revrectemplate/>
    <revrecstartdate>
      <year></year>
      <month></month>
      <day></day>
    </revrecstartdate>
    <revrecenddate>
      <year></year>
      <month></month>
      <day></day>
    </revrecenddate>
    <renewalmacro/>
    <currency>USD</currency>
    <exchratedate>
      <year></year>
      <month></month>
      <day></day>
    </exchratedate>
    <exchratetype/>
    <exchrate>1</exchrate>
    <trx_price>4.99</trx_price>
    <trx_value>4.99</trx_value>
    <projectid/>
    <customerid>2--2</customerid>
    <vendorid/>
    <employeeid/>
    <classid/>
    <contractid/>
    <taskno/>
    <billingtemplate/>
    <sourcedocumentid/>
    <sourcedocumentkey/>
    <sourcedocumententrytkey/>
    <discountpercent/>
    <linesubtotals/>
    <customfields>
      <customfield>
        <customfieldname>TESTCUSTOM</customfieldname>
        <customfieldvalue>true</customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>TEST_NUMBER</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>NUMBER1</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>TEST_DATE</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>MYTESTFIELD</customfieldname>
        <customfieldvalue>true</customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>TESTBOX1</customfieldname>
        <customfieldvalue>true</customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMUSERDEFINEDDEMTSS</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMA</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMA777</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMAA5678</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMSITE</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
    </customfields>
  </sotransitem>
  <sotransitem>
    <recordno>40563</recordno>
    <dochdrno>16987</dochdrno>
    <docid/>
    <bundlenumber/>
    <itemid>12</itemid>
    <itemdesc>Loktar</itemdesc>
    <line_no>1</line_no>
    <warehouseid>Main</warehouseid>
    <quantity>1</quantity>
    <unit>Each</unit>
    <price>90</price>
    <retailprice>90</retailprice>
    <totalamount>90</totalamount>
    <taxrate/>
    <tax/>
    <grossamount/>
    <locationid/>
    <departmentid>fail</departmentid>
    <memo/>
    <discsurchargememo/>
    <revrectemplate/>
    <revrecstartdate>
      <year></year>
      <month></month>
      <day></day>
    </revrecstartdate>
    <revrecenddate>
      <year></year>
      <month></month>
      <day></day>
    </revrecenddate>
    <renewalmacro/>
    <currency>USD</currency>
    <exchratedate>
      <year></year>
      <month></month>
      <day></day>
    </exchratedate>
    <exchratetype/>
    <exchrate>1</exchrate>
    <trx_price>90</trx_price>
    <trx_value>90</trx_value>
    <projectid/>
    <customerid>2--2</customerid>
    <vendorid/>
    <employeeid/>
    <classid/>
    <contractid/>
    <taskno/>
    <billingtemplate/>
    <sourcedocumentid/>
    <sourcedocumentkey/>
    <sourcedocumententrytkey/>
    <discountpercent/>
    <linesubtotals/>
    <customfields>
      <customfield>
        <customfieldname>TESTCUSTOM</customfieldname>
        <customfieldvalue>true</customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>TEST_NUMBER</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>NUMBER1</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>TEST_DATE</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>MYTESTFIELD</customfieldname>
        <customfieldvalue>true</customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>TESTBOX1</customfieldname>
        <customfieldvalue>true</customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMUSERDEFINEDDEMTSS</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMA</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMA777</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMAA5678</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMSITE</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
    </customfields>
  </sotransitem>
</sotransitems>`

I'm just looking for a small sample or suggestion on how best handle parsing and storing each set of into a object. Any information will help and I'm find with doing additional research based on your feedback.

Thanks!

deftek
  • 3
  • 2

1 Answers1

0

There are to main ways to parse XML data:

  1. DOM parsers.

    They load full xml file into memory and build DOM (Document Object Model). It's allow programmer to use many nice technologys to navigate in document or retrieve data from it (i.e. XPath, xslt transformations, xml-schema to class transformation). Minuses of this technique is that it may required a LOT of memmory, and may be slow (depends on parser, dom model, indexes in dom ...).

In example i remove some fields from sotransitem and customfields for simplicity.

Example:

class definition:

 class Sotransitem:

    recordno = None
    unit = None
    customfields = None

    def __init__( self ):
        self.recordno
        self.unit
        self.customfields = {}

    def __repr__( self ):
        return "Item( rec_no: {rec}, fields: {fields} )".format( rec=self.recordno,
                                                                 fields = str( self.customfields ) )

Here i will use standart python library, but you also should look at other librarys. Most popular, as far as i know, are lxml, BeautifulSoup.

actual parser:

import xml.etree.ElementTree as ET

tree = ET.parse( 'test.xml' )
root = tree.getroot()

all_items = []

for node in root.findall( 'sotransitem' ):
    item = Sotransitem()
    item.recordno = int( node.find( 'recordno' ).text )
    item.unit = node.find('unit').text

    for custom_node in node.findall('./customfields/customfield'):
        value = custom_node.find('customfieldvalue').text
        name = custom_node.find('customfieldname').text
        item.customfields[ name ] = value

    all_items.append( item )

print( all_items ) 
# [Item( rec_no: 40562, fields: {'TEST_NUMBER': None, 'TESTCUSTOM': 'true'} ), Item( rec_no: 40563, fields: {'TESTCUSTOM': 'true', 'NUMBER1': None} )

It fulfil most of my needs, but with xml-schema it will be even simpler. check lxml "assering schema" example

  1. SAX parsers. Read xml by small parts, when find tag (begining or ending tag) it's fire an event with found tag and its data (if it was close tag). SAX parsers normally discards almost all of that information once reported (it does, however, keep some things, for example a list of all elements that have not been closed yet).

    Pros: SAX parser require constant amount of RAM, much and less then DOM.

    Cons: It's impossible to use most of XML technologys.

Example:

all_items = []

# get the root element
nodes_parser = ET.iterparse( 'test.xml', ["start", "end"] )
event, root = next( nodes_parser )

item = None

for event, node in nodes_parser:
    if( event=="start" and node.tag == "sotransitem" ):
        if item is not None:
            all_items.append( item )
        item = Sotransitem()
        sotrans_node = node;

    elif event == "end":
        tag = node.tag
        if tag == "recordno":
            item.recordno = int( node.text )
        elif  tag == "unit":
            item.unit = node.text

        elif tag == 'customfield':
            value = node.find('customfieldvalue').text
            name = node.find('customfieldname').text
            item.customfields[ name ] = value

        sotrans_node.clear() #other wise it will be ceeped in "node" until "end" event on "sotransitem"
    else:
        sotrans_node.clear()
    root.clear() # same as before but for root 

if item is not None:
    all_items.append( item )

print( all_items )
#same resutl as before

Which way to choose depends on amount of data that stored in your XML file.

If its just simple script (written once to be soon fogoten) that retrive some data from small file just use DOM.

If it is config file or little message between servers few megabytes long: DOM with auto xml to class transformation probably will be the best.

If your data is too big to stay in server memmory (ie OpenStreeMap world.xml) or there are too many messages parsing at once, then you should choose SAX.

Community
  • 1
  • 1
Arnial
  • 1,433
  • 1
  • 11
  • 10