I've been looking at other questions here in SO about zip and the magic * which have helped me a lot in understanding how it works. For example:
- Why does x,y = zip(*zip(a,b)) work in Python?
- How does zip(*[iter(s)]*n) work in Python?
- Zip as a list comprehension
- XML to csv(-like) format
Even though I still have to think a little about what's actually happening I have a better understanding now. So what I'm trying to achieve is to convert an xml document into csv. That last link above gets really close to what I want to do, however my source xml doesn't have the most consistent structure, and that's where I'm hitting a wall. Here's an example of my source xml (simplified for the sake of this example):
<?xml version="1.0" encoding="utf-8"?>
<root>
<child>
<Name>John</Name>
<Surname>Doe</Surname>
<Phone>123456</Phone>
<Phone>654321</Phone>
<Fax>111111</Fax>
</child>
<child>
<Name>Tom</Name>
<Surname>Cat</Surname>
<Phone>98765</Phone>
<Phone>56789</Phone>
<Phone>00000</Phone>
</child>
</root>
As you can see I can have 2 or more of the same elements under <child>
. Also, if a certain element has no value, it won't even exist (like on the second <child>
where there's no <Fax>
).
This is the code I currently have:
data = etree.parse(open('test.xml')).findall(".//child")
tags = ('Name', 'Surname', 'Phone', 'Fax')
for child in data:
for a in zip(*[child.findall(x) for x in tags]):
print([x.text for x in a])
>> Result:
['John', 'Doe', '123456', '111111']
Although this gives me a format I can use to write a csv, it has two problems:
It skips the 2nd child because it doesn't have the
<Fax>
element (I suppose). If I only search for elements that exist in both children by settingtags = ('Name', 'Surname')
then it I have 2 lists back (great!)That first child actually has 2 phone numbers but only one is returned
From what I could test, stuff starts to disappear when zip* comes into play... How could I maybe set a default value so I can keep empty values?
Update: to make it more clear what I intend to do, here's the expected output format (CSV with semicolon separator, where multiple values in each field are split by a comma):
John;Joe;123456,654321;111111;
Tom;Cat;98765,56789;00000;;
Thanks!