Given the sample xml below:
<_Document>
<_Data1> 'foo'
<_SubData1> 'bar1' </_SubData1>
<_SubData2> 'bar2' </_SubData2>
<_SubData3> 'bar3' </_SubData3>
</_Data1>
</_Document>
I want to capture each SubData value and update it with the Data1 value in a dictionary and then append that value to a list. Such that the output would look something like:
[{Data1: 'foo', SubData1: 'bar1'}, {Data1: 'foo', SubData2: 'bar2'}, {Data1: 'foo', SubData3: 'bar3'}]
My code is:
from lxml import etree
import re
new_records = []
for child in root.iter('_Document'): #finding all children with each 'Document' string
for top_data in child.iter(): #iterating through the entirety of each 'Document' sections tags and text.
if "Data" in top_data.tag:
for data in top_data:
rec = {}
if data.text is not None and data.text.isspace() is False: #avoiding NoneTypes and empty data.
g = data.tag.strip("_") #cleaning up the tag
rec[g] = data.text.replace("\n", " ") #cleaning up the value
for b in re.finditer(r'^_SubData', data.tag): #searching through each 'SubData' contained in a given tag.
for subdata in data:
subdict = {}
if subdata.text is not None: #again preventing NoneTypes
z = subdata.tag.strip("_") #tag cleaning
subdict[z] = subdata.text.replace("\n", " ") #text cleaning
rec.update(subdict) #update the data record dictionary with the subdata
new_records.append(rec) #appending to the list
This, unfortunately, outputs:
[{Data1: 'foo', SubData3: 'bar3'}]
As it only updates and appends the final update of the dictionary.
I've tried different varieties of this including initializing a list after the first 'if' statement in the second for loop to append after each loop pass, but that required quite a bit of clean up at the end to get through the nesting it would cause. I've also tried initializing empty dictionaries outside of the loops to update to preserve the previous updates and append that way.
I'm curious if there is some functionality of lxml that I've missed or a more pythonic approach to get the desired output.