1

I'm facing a very weird bug in my Python code, which I've been trying to figure out, to no end. I'd be really grateful if someone could point out the source of my error.

Before getting to the code, I'll first explain what I'm trying to do. I have a nested XML file, and I'm trying to (1) get all attribute names and their values; and (2) get all node names and their text values; for all subelements, nested or otherwise, of a specific node in the file. Once I get the above data as key:value pairs in a dictionary, I'll write the dictionary as one row to a delimited file, using csv.DictWriter.

For this, I defined a recursive function traverse which takes an xml.etree.ElementTree.Element element, does the aforementioned recursively for the element, creating key:value pairs (either attribute:value or nodename:text pairs) in a dictionary, and finally returning it (is_nested(element) returns True if element has subelements, and False otherwise; no_junk is a function for removing junk words from a junkwords list):

def traverse(element,junkwords=[]):
          if element.attrib == {}:
                  pass
          else:
                  for attribute in element.attrib:
                          if attribute not in data_dict:
                                  data_dict[no_junk(attribute,junkwords)] = element.attrib[attribute]
                          else:

                                  data_dict[no_junk(attribute,junkwords)] = data_dict[no_junk(attribute,junkwords)] + '|' + element.attrib[attribute]
          for subelement in element:
                  if is_nested(subelement):
                          traverse(subelement,junkwords)
                  else:   
                          if subelement.text != None:
                                  if subelement.tag not in data_dict:
                                          data_dict[no_junk(subelement.tag,junkwords)] = subelement.text
                                  else:
                                          data_dict[no_junk(subelement.tag,junkwords)] = data_dict[no_junk(subelement.tag,junkwords)] + '|' + subelement.text
                          else:
                                  if subelement.tag not in data_dict:
                                          data_dict[no_junk(subelement.tag,junkwords)] = ''
                                  else:
                                          data_dict[no_junk(subelement.tag,junkwords)] = data_dict[no_junk(subelement.tag,junkwords)] + '|' + ''
          return data_dict

Now, there are many such XML files and multiple such target elements which I'm trying to traverse in a given XML file. So this is how I actually use the function:

for xmlfile in xmlfiles:
    tree = ET.ElementTree(file=xmlfile)
    root = tree.getroot()
    target_elements = root.findall('.//tag')

    for element in target_elements:
        data_dict = {}
        data_dict = traverse(element)              
        with open('FINAL.tsv','a+') as f:
            writer = csv.DictWriter(f,delimiter='\t',fieldnames=headers,lineterminator='\n')
            writer.writerow(data_dict)

But now, the delimited file is being written very weirdly; Every row is written indeed, yes but each row is being written multiple times! In each iteration, the data dictionary is supposed to change, but it doesn't seem to be happening here! I've checked and rechecked the XML file, and I've made sure that the data in it is different every iteration. I'm positive the issue isn't with the XML file itself or its parsing. But my program logic is erring somewhere. What could be the possible source of error?

EDIT:

A sample XML file (stripped to its bare bones) named 'test.xml' looks like this (there are a lot more similar subelements in the <body> tag, there may be multiple <body> tags, and there may be multiple, different nested elements like <PropertyImage>):

<?xml version='1.0' encoding='UTF-8'?>
<Envelope>
    <Body>
        <Response>
            <response>
                <body>
                    <ProductCode>ABC123</ProductCode>
                    <ProductType>Type1</ProductType>
                    <ProductName>XYZ</ProductName>
                    <PropertyImage>
                        <VendorID>9145</VendorID>
                        <Caption nil="true"/>
                        <Thumbnail>http://www.someurl1.com/image.jpg</Thumbnail>
                        <ActualSize>http://www.someurl2.com/image.jpg</ActualSize>
                    </PropertyImage>
                    <ProductDetails>Some Random details</ProductDetails>
                    <ResortFee>0.0</ResortFee>
                    <NonRefundable>0</NonRefundable>
                    <VendorCountryISO>USA</VendorCountryISO>
                    <VendorZip>30601</VendorZip>
                </body>
        </response>
    </Response>
</Body>

Correspondingly, my code would be:

tree = ET.ElementTree(file='test.xml')
root = tree.getroot()
target_elements = root.findall('.//body')

for element in target_elements:
    data_dict = {}
    data_dict = traverse(element)              
    with open('FINAL.tsv','a+') as f:
        writer = csv.DictWriter(f,delimiter='\t',fieldnames=headers,lineterminator='\n')
        writer.writerow(data_dict)

...following which my expected output is a delimited file which writes

data_dict = {'ProductCode':'ABC123','ProductType':'Type1','ProductName':'XYZ','VendorID':9145,'Caption':'',Thumbnail:'http://www.someurl1.com/image.jp',ActualSize:'http://www.someurl2.com/image.jpg','ProductDetails':'Some Random details','ResortFee':'0.0','NonRefundable':'0','VendorCountryISO':'USA','VendorZip':'30601'}

as a row. Now, in a single XML file, there may be multiple <body> tags, the data_dicts of each of which gets appended to the above delimited file. Also, there may be multiple XML files too, the data_dicts of all of which get appended to the same delimited file above.

Train Heartnet
  • 785
  • 1
  • 12
  • 24
  • Let me know if the linked duplicate doesn't answer your question. – vaultah Dec 18 '16 at 09:50
  • @vaultah: Thank you for the quick response! I'm not sure if it answers the question. It does seem to fit the bill of being the source of the error, but the common workaround (of giving a default value of `None`; edited above code) doesn't seem to solve the issue. – Train Heartnet Dec 18 '16 at 10:34
  • @vaultah: I have confirmed that it is not the source of error. I redefined the `traverse` function without using a mutable default dictionary (see edited code), and the error still persists. – Train Heartnet Dec 18 '16 at 11:12
  • I apologise for misdiagnosing the problem... Can you post the contents of one of the XML files and the desired output? – vaultah Dec 18 '16 at 11:46
  • That's alright! I wasn't aware of the dangers of mutable default arguments until today. A new piece of knowledge, thanks to you. :) And I've edited the question, adding details of a sample XML file and the desired output. Thank you for your help! :) – Train Heartnet Dec 18 '16 at 13:41
  • Does this input/output actually produce the multiple rows written? – Wayne Werner Dec 19 '16 at 21:15
  • 1
    Would love to help, but can't get anything you posted to run. Please provide [MCVE](http://stackoverflow.com/help/mcve). We should be able to cut-n-paste the xml sample and save as `test.xml`, then cut-n-paste the code sample and run it with no modifications whatsoever and reproduce the issue. – Mark Tolonen Dec 21 '16 at 03:14

0 Answers0