0

I have this xml file layout and I want to extract all children names (Amy, Max, and Derek):

  <data>
    <dataentry>
        <Name>John</Name>
        <Birthday>3/3/93</BirthDay>
        <Children>
            <Child> Amy </Child>
            <Child> Max </Child>
            <Child> Derek </Child>
         </Children>
    </dataentry>

    <dataentry>
          ....
    </dataentry>
  </data>

Python code:

root = tree.getroot()
for dataentry in root.findall('dataentry'):
   for children in dataentry.findall('Children'):
      for child in children.findall('Child'):
          print child.text

I have this nested for loop but is there a faster or more elegant way?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
user1224478
  • 345
  • 2
  • 5
  • 15
  • regexp could be faster – Iłya Bursov May 08 '15 at 21:16
  • @Lashane it could be faster, but I'm not sure about whether this is a right thing to do ([reference](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)). – alecxe May 08 '15 at 21:22
  • @alecxe regexp are not right way to parse xml or html for sure, but for some simple and well defined files they work just fine – Iłya Bursov May 08 '15 at 21:23
  • @Lashane, regexps could also (would also) be *wrong*. Add CDATA sections or comments, and a naive regex will no have idea that contents inside them shouldn't be matched against. – Charles Duffy May 08 '15 at 21:26
  • Is `` your root? Is your root somewhere else? Nailing down the root to reduce the amount of recursive searching needed is going to be one of the biggest improvements possible. – Charles Duffy May 08 '15 at 21:27
  • 1
    Also, if you're doing this for a really huge amount of data, or performing this kind of operation repeatedly, you might want to consider using an XQuery database, which will index your XML content for (extremely) fast searches of this type. – Charles Duffy May 08 '15 at 21:28

2 Answers2

1

You can do it in a single loop using xpath():

for child in root.xpath("./dataentry/Children/Child"):
    print child.text

taking into account that data is your root.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 1
    its definitely more elegant, but I'm not sure about faster – Iłya Bursov May 08 '15 at 21:18
  • 1
    Using `./data/dataentry/Children/Child` will be more efficient than `//data/dataentry/Children/Child`, since you're preventing the need to search recursively for starting points. Granted, this means we need the OP to clarify their starting point. – Charles Duffy May 08 '15 at 21:25
  • 1
    @CharlesDuffy Starting point is definitely `data`. Otherwise, the presented by the OP code would not work since `findall()` is not recursive. I've edited the expression. Thanks! – alecxe May 08 '15 at 21:28
0

You can use a SAX parser to do this. The idea is that the parser will perform actions while it is traversing, instead of reading everything into a tree and searching for children later. This saves both memory and time. However, this will print all child nodes regardless of path, so it may or may not be what you want.

from xml import sax


class SAXParser(sax.ContentHandler):
    def __init__(self):
        self.current_string = None

    def characters(self, content):
        self.current_string = content

    def endElement(self, name):
        if name == 'Child':
            print self.current_string

sax.parseString(<string_to_parse>, SAXParser())
Lawrence H
  • 467
  • 4
  • 10
  • You could adopt this to filter based on the path rather easily -- track whether we're on a false track (track nesting depth; any first-level entry other than `dataentry`, any second-level entry other than `Children`, or any third-level entry other than `Child`, and set a flag indicating that we can ignore/discard any data under it). – Charles Duffy May 08 '15 at 21:49
  • That said, I'm actually curious as to whether or not it _is_ more CPU-efficient to go this route in practice (no question that it's more memory-efficient). Sure, building a DOM structure is a lot of expense/overhead, but it's expense/overhead done only in C-library code, vs here where we're doing a bunch of calls in and out of the Python interpreter. – Charles Duffy May 08 '15 at 21:50
  • Don't forget actual performance will also vary depending on the XML in question. For example, if the probability that a node is a `Child` is low, the DOM approach would waste a lot of time building and traversing. – Lawrence H May 08 '15 at 23:49