0

I've been looking at other questions here in SO about zip and the magic * which have helped me a lot in understanding how it works. For example:

Even though I still have to think a little about what's actually happening I have a better understanding now. So what I'm trying to achieve is to convert an xml document into csv. That last link above gets really close to what I want to do, however my source xml doesn't have the most consistent structure, and that's where I'm hitting a wall. Here's an example of my source xml (simplified for the sake of this example):

<?xml version="1.0" encoding="utf-8"?>
<root>
    <child>
        <Name>John</Name>
        <Surname>Doe</Surname>
        <Phone>123456</Phone>
        <Phone>654321</Phone>
        <Fax>111111</Fax>
    </child>
    <child>
        <Name>Tom</Name>
        <Surname>Cat</Surname>
        <Phone>98765</Phone>
        <Phone>56789</Phone>
        <Phone>00000</Phone>
    </child>
</root>

As you can see I can have 2 or more of the same elements under <child>. Also, if a certain element has no value, it won't even exist (like on the second <child> where there's no <Fax>).

This is the code I currently have:

data = etree.parse(open('test.xml')).findall(".//child")
tags = ('Name', 'Surname', 'Phone', 'Fax')

for child in data:
    for a in zip(*[child.findall(x) for x in tags]):
        print([x.text for x in a])

>> Result:

['John', 'Doe', '123456', '111111']

Although this gives me a format I can use to write a csv, it has two problems:

  1. It skips the 2nd child because it doesn't have the <Fax>element (I suppose). If I only search for elements that exist in both children by setting tags = ('Name', 'Surname') then it I have 2 lists back (great!)

  2. That first child actually has 2 phone numbers but only one is returned

From what I could test, stuff starts to disappear when zip* comes into play... How could I maybe set a default value so I can keep empty values?

Update: to make it more clear what I intend to do, here's the expected output format (CSV with semicolon separator, where multiple values in each field are split by a comma):

John;Joe;123456,654321;111111;
Tom;Cat;98765,56789;00000;;

Thanks!

Community
  • 1
  • 1
bergonzzi
  • 389
  • 5
  • 13

2 Answers2

0

You say, in regards to your first problem, that "[i]f I only search for elements that exist in both children ... I have 2 lists back," implying that the lack of output for the second child has something to do with interaction between the two child nodes. That's not the case. The aspect of the behavior of zip that you appear to be overlooking is that zip stops processing its arguments after it's exhausted the shortest one.

Consider the output of the following simplification of your code:

for child in data:
    print [child.findall(x) for x in tags]

The output will be (omitting memory addresses):

[[<Element 'Name'>], [<Element 'Surname'>], [<Element 'Phone'>, <Element 'Phone'>], [<Element 'Fax'>]]
[[<Element 'Name'>], [<Element 'Surname'>], [<Element 'Phone'>, <Element 'Phone'>, <Element 'Phone'>], []]

Notice that the second list has an empty sublist (because the second child has no Fax node). This means that when you zip those sublists together the process stops immediately and returns an empty list; on its first pass it's already exhausted one of the sublists. That's why your second child is omitted in the output; it has nothing to do with elements being shared between children.

The same principle of zip's behavior explains your second problem. Notice that the first output list above consists of four elements: a list of length one for three of your tags and a list of length two with the two phone elements. When you zip those together, the process again stops after exhausting any of the sublists. In this case, the shortest sublist has length one, so the result only draws one element from the phone sublist.

I'm not sure exactly what you want your output to look like, but if you're simply trying to construct, for each child node, a list containing the text of each element in that node, you can do something like:

for child in data:
    print [x.text for x in child]

That will produce:

['John', 'Doe', '123456', '654321', '111111']
['Tom', 'Cat', '98765', '56789', '00000']
Alp
  • 2,766
  • 18
  • 13
  • Hi Alp, thanks for your reply. However it's missing a few points. I will edit my response to add the expected output to make it more clear. My purpose is to convert the result in csv format so your output would not be ok in that context. This is what I want in the end: `John;Joe;123456,654321;111111;` `Tom;Cat;98765,56789;00000;;` So if I have 2 `` elements, they need to end up together in one csv "field". Additionally, the order is important, being a csv each field will have to match the corresponding header obviously. – bergonzzi Jul 09 '13 at 11:26
0

I hacked this together. Read the csv module's documentation and change accordingly if you want a more specific format.

from csv import DictWriter
from StringIO import StringIO
import xml.etree
from xml.etree import ElementTree

xml_str = \
'''
<?xml version="1.0" encoding="utf-8"?>
<root>
    <child>
        <Name>John</Name>
        <Surname>Doe</Surname>
        <Phone>123456</Phone>
        <Phone>654321</Phone>
        <Fax>111111</Fax>
    </child>
    <child>
        <Name>Tom</Name>
        <Surname>Cat</Surname>
        <Phone>98765</Phone>
        <Phone>56789</Phone>
        <Phone>00000</Phone>
    </child>
</root>
'''

root = ElementTree.parse(StringIO(xml_str.strip()))
entry_list = []
for child_tag in root.iterfind("child"):
    child_tags = child_tag.getchildren()

    tag_count = {}
    [tag_count.__setitem__(tag.tag, tag_count.get(tag.tag, 0) + 1) for tag in child_tags]

    m_count = dict([(key, 0) for (key, val) in filter(lambda (x, y): y > 1, tag_count.items())])

    enum = lambda x: ("%s%s" % (x.tag, (" %d" % m_count.setdefault(x.tag, m_count.pop(x.tag) + 1)) if(tag_count[x.tag] > 1) else ""), x.text)
    tmp_dict = dict([enum(tag) for tag in child_tags])

    entry_list.append(tmp_dict)

field_order = ["Name", "Surname", "Phone 1", "Phone 2", "Phone 3", "Fax"]
field_check = lambda q: field_order.index(q) if(field_order.count(q)) else sys.maxint

all_fields = list(reduce(lambda x, y: x | set(y.keys()), entry_list, set([])))
all_fields.sort(cmp=lambda x, y: field_check(x) - field_check(y))

with open("test.csv", "w") as file_h:
    writer = DictWriter(file_h, all_fields, restval="", extrasaction="ignore", dialect="excel", lineterminator="\n")
    writer.writerow(dict(zip(all_fields, all_fields)))
    writer.writerows(entry_list)
dilbert
  • 3,008
  • 1
  • 25
  • 34
  • Wowww... that works perfectly but I'll have to spend quite some hours trying to understand everything you just did there! Thanks! – bergonzzi Jul 09 '13 at 14:09