I'm facing a very weird bug in my Python code, which I've been trying to figure out, to no end. I'd be really grateful if someone could point out the source of my error.
Before getting to the code, I'll first explain what I'm trying to do. I have a nested XML file, and I'm trying to (1) get all attribute names and their values; and (2) get all node names and their text values; for all subelements, nested or otherwise, of a specific node in the file. Once I get the above data as key:value pairs in a dictionary, I'll write the dictionary as one row to a delimited file, using csv.DictWriter
.
For this, I defined a recursive function traverse
which takes an xml.etree.ElementTree.Element
element, does the aforementioned recursively for the element, creating key:value pairs (either attribute:value or nodename:text pairs) in a dictionary, and finally returning it (is_nested(element)
returns True
if element has subelements, and False
otherwise; no_junk
is a function for removing junk words from a junkwords
list):
def traverse(element,junkwords=[]):
if element.attrib == {}:
pass
else:
for attribute in element.attrib:
if attribute not in data_dict:
data_dict[no_junk(attribute,junkwords)] = element.attrib[attribute]
else:
data_dict[no_junk(attribute,junkwords)] = data_dict[no_junk(attribute,junkwords)] + '|' + element.attrib[attribute]
for subelement in element:
if is_nested(subelement):
traverse(subelement,junkwords)
else:
if subelement.text != None:
if subelement.tag not in data_dict:
data_dict[no_junk(subelement.tag,junkwords)] = subelement.text
else:
data_dict[no_junk(subelement.tag,junkwords)] = data_dict[no_junk(subelement.tag,junkwords)] + '|' + subelement.text
else:
if subelement.tag not in data_dict:
data_dict[no_junk(subelement.tag,junkwords)] = ''
else:
data_dict[no_junk(subelement.tag,junkwords)] = data_dict[no_junk(subelement.tag,junkwords)] + '|' + ''
return data_dict
Now, there are many such XML files and multiple such target elements which I'm trying to traverse in a given XML file. So this is how I actually use the function:
for xmlfile in xmlfiles:
tree = ET.ElementTree(file=xmlfile)
root = tree.getroot()
target_elements = root.findall('.//tag')
for element in target_elements:
data_dict = {}
data_dict = traverse(element)
with open('FINAL.tsv','a+') as f:
writer = csv.DictWriter(f,delimiter='\t',fieldnames=headers,lineterminator='\n')
writer.writerow(data_dict)
But now, the delimited file is being written very weirdly; Every row is written indeed, yes but each row is being written multiple times! In each iteration, the data
dictionary is supposed to change, but it doesn't seem to be happening here! I've checked and rechecked the XML file, and I've made sure that the data in it is different every iteration. I'm positive the issue isn't with the XML file itself or its parsing. But my program logic is erring somewhere. What could be the possible source of error?
EDIT:
A sample XML file (stripped to its bare bones) named 'test.xml' looks like this (there are a lot more similar subelements in the <body>
tag, there may be multiple <body>
tags, and there may be multiple, different nested elements like <PropertyImage>
):
<?xml version='1.0' encoding='UTF-8'?>
<Envelope>
<Body>
<Response>
<response>
<body>
<ProductCode>ABC123</ProductCode>
<ProductType>Type1</ProductType>
<ProductName>XYZ</ProductName>
<PropertyImage>
<VendorID>9145</VendorID>
<Caption nil="true"/>
<Thumbnail>http://www.someurl1.com/image.jpg</Thumbnail>
<ActualSize>http://www.someurl2.com/image.jpg</ActualSize>
</PropertyImage>
<ProductDetails>Some Random details</ProductDetails>
<ResortFee>0.0</ResortFee>
<NonRefundable>0</NonRefundable>
<VendorCountryISO>USA</VendorCountryISO>
<VendorZip>30601</VendorZip>
</body>
</response>
</Response>
</Body>
Correspondingly, my code would be:
tree = ET.ElementTree(file='test.xml')
root = tree.getroot()
target_elements = root.findall('.//body')
for element in target_elements:
data_dict = {}
data_dict = traverse(element)
with open('FINAL.tsv','a+') as f:
writer = csv.DictWriter(f,delimiter='\t',fieldnames=headers,lineterminator='\n')
writer.writerow(data_dict)
...following which my expected output is a delimited file which writes
data_dict = {'ProductCode':'ABC123','ProductType':'Type1','ProductName':'XYZ','VendorID':9145,'Caption':'',Thumbnail:'http://www.someurl1.com/image.jp',ActualSize:'http://www.someurl2.com/image.jpg','ProductDetails':'Some Random details','ResortFee':'0.0','NonRefundable':'0','VendorCountryISO':'USA','VendorZip':'30601'}
as a row. Now, in a single XML file, there may be multiple <body>
tags, the data_dict
s of each of which gets appended to the above delimited file. Also, there may be multiple XML files too, the data_dict
s of all of which get appended to the same delimited file above.