17

The XML:

<?xml version="1.0"?>
<pages>
    <page>
        <url>http://example.com/Labs</url>
        <title>Labs</title>
        <subpages>
            <page>
                <url>http://example.com/Labs/Email</url>
                <title>Email</title>
                <subpages>
                    <page/>
                    <url>http://example.com/Labs/Email/How_to</url>
                    <title>How-To</title>
                </subpages>
            </page>
            <page>
                <url>http://example.com/Labs/Social</url>
                <title>Social</title>
            </page>
        </subpages>
    </page>
    <page>
        <url>http://example.com/Tests</url>
        <title>Tests</title>
        <subpages>
            <page>
                <url>http://example.com/Tests/Email</url>
                <title>Email</title>
                <subpages>
                    <page/>
                    <url>http://example.com/Tests/Email/How_to</url>
                    <title>How-To</title>
                </subpages>
            </page>
            <page>
                <url>http://example.com/Tests/Social</url>
                <title>Social</title>
            </page>
        </subpages>
    </page>
</pages>

The code:

// rexml is the XML string read from a URL
from xml.etree import ElementTree as ET
tree = ET.fromstring(rexml)
for node in tree.iter('page'):
    for url in node.iterfind('url'):
        print url.text
    for title in node.iterfind('title'):
        print title.text.encode("utf-8")
    print '-' * 30

The output:

http://example.com/article1
Article1
------------------------------
http://example.com/article1/subarticle1
SubArticle1
------------------------------
http://example.com/article2
Article2
------------------------------
http://example.com/article3
Article3
------------------------------

The Xml represents a tree like structure of a sitemap.

I have been up and down the docs and Google all day and can't figure it out hot to get the node depth of entries.

I used counting of the children container but that only works for the first parent and then it breaks as I can't figure it out how to reset. But this is probably just a hackish idea.

The desired output:

0
http://example.com/article1
Article1
------------------------------
1
http://example.com/article1/subarticle1
SubArticle1
------------------------------
0
http://example.com/article2
Article2
------------------------------
0
http://example.com/article3
Article3
------------------------------
maxschlepzig
  • 35,645
  • 14
  • 145
  • 182
transilvlad
  • 13,974
  • 13
  • 45
  • 80
  • Could you please provide an example xml? Plus, are you required to count node depth - I mean, is it ok to parse the url itself and count how many items are there after the domain name? – alecxe Jun 24 '13 at 12:41
  • Added XML example. The ones I use have a depth of up to 7 and total no pf pages of over 500. – transilvlad Jun 24 '13 at 12:51
  • 1
    @tntu, Example seems broken. – falsetru Jun 24 '13 at 13:06
  • 1
    ElementTree Elements have no handle on the parent node, so walking back up the tree to count the depth of a particular node isn't possible unless you create some sort of node mapping. you can [construct that mapping yourself](http://stackoverflow.com/a/2170994/748858) however. – mgilson Jun 24 '13 at 13:09

6 Answers6

12

The Python ElementTree API provides iterators for depth-first traversal of a XML tree - unfortunately, those iterators don't provide any depth information to the caller.

But you can write a depth-first iterator that also returns the depth information for each element:

import xml.etree.ElementTree as ET

def depth_iter(element, tag=None):
    stack = []
    stack.append(iter([element]))
    while stack:
        e = next(stack[-1], None)
        if e == None:
            stack.pop()
        else:
            stack.append(iter(e))
            if tag == None or e.tag == tag:
                yield (e, len(stack) - 1)

Note that this is more efficient than determining the depth via following the parent links (when using lxml) - i.e. it is O(n) vs. O(n log n).

maxschlepzig
  • 35,645
  • 14
  • 145
  • 182
  • That code is not working with python2. ``iter(e)´´ is deprecated and should be replaced by ``e.iter()´´ but still it is not working as expected. Could you please check again? – Christian K. Aug 25 '18 at 11:54
  • @ChristianK. Where do you read the `iter()` is deprecated? I initially wrote the code for Python 3. But I just tested it under Python 2.7.15 and it works for me without any modifications: `root = ET.fromstring('1')` such that `list(depth_iter(root))` returns a list with all elements and proper depth information, as expected. – maxschlepzig Aug 25 '18 at 18:29
  • Sorry, completely my fault. Code works as expected. But the xml above is malformed. – Christian K. Sep 01 '18 at 12:31
  • @ChristianK. what do you mean with malformed? For example, `xmllint` happily processes it. – maxschlepzig Sep 01 '18 at 15:01
  • There is no closing tag for the `` tag of line 10 and I think that `` is not a correct closing tag. – Christian K. Sep 02 '18 at 00:33
  • @ChristianK., I thought you meant my small XML example in my previous comment. Yes, the OP posted some invalid XML in his question. Thus, I've edited the question and made his example valid. – maxschlepzig Sep 02 '18 at 10:36
  • And how does one the use the `def_iter` to actual get the output? – Confounded Aug 17 '20 at 11:18
  • @Confounded - you can play around with the snippet I posted in my [other comment](https://stackoverflow.com/questions/17275524/xml-etree-elementtree-get-node-depth/41536162?noredirect=1#comment90990582_41536162). Does this help? – maxschlepzig Aug 17 '20 at 17:43
6
import xml.etree.ElementTree as etree
tree = etree.ElementTree(etree.fromstring(rexml)) 
maxdepth = 0
def depth(elem, level): 
   """function to get the maxdepth"""
    global maxdepth
    if (level == maxdepth):
        maxdepth += 1
   # recursive call to function to get the depth
    for child in elem:
        depth(child, level + 1) 


depth(tree.getroot(), -1)
print(maxdepth)
Rachit
  • 61
  • 1
  • 3
  • 3
    It is always recommended to add at least a minimal functional description of the code provided, explaining how it answers the question. – Roberto Caboni Feb 08 '20 at 21:37
4

Used lxml.html.

import lxml.html

rexml = ...

def depth(node):
    d = 0
    while node is not None:
        d += 1
        node = node.getparent()
    return d

tree = lxml.html.fromstring(rexml)
for node in tree.iter('page'):
    print depth(node)
    for url in node.iterfind('url'):
        print url.text
    for title in node.iterfind('title'):
        print title.text.encode("utf-8")
    print '-' * 30
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • Much faster than 'node.xpath("count(ancestor::*)")' or 'len(tree.getelementpath(node).split("/"))'. Thanks :). – ephes Jun 02 '16 at 12:17
  • 1
    This doesn't really answer the question since the OP asked how to achieve it with `from xml.etree import ElementTree as ET`, not `lxml.html`. – Confounded Aug 17 '20 at 11:15
1

My approach, recursive function to list with level. You must first set the initial dept of the node you are passing:

# Definition of recursive function
def listchildrens(node,depth):
    # Print node, indent with depth 
    print(" " * depth,"Type",node.tag,"Attributes",node.attrib,"Depth":depth}
    # If node has childs, recall function for the node with existing depth
    if len(node) > 0:
        # Increase depth and recall function
        depth+= 1
        for child in node:
            listchildrens(node,depth)
# Define starting depth
startdepth = 1
# Call the function with the XML body and starting depth
listchildrens(xmlBody,startdepth)
0

lxml is best for this, but if you have to use the standard library, do not use iter and go walking the tree, so you can know where you are.

from xml.etree import ElementTree as ET
tree = ET.fromstring(rexml)

def sub(node, tag):
    return node.findall(tag) or []

def print_page(node, depth):
    print "%s" % depth
    url = node.find("url")
    if url is not None:
        print url.text
    title = node.find("title")
    if title is not None:
        print title.text
    print '-' * 30

def find_pages(node, depth=0):
    for page in sub(node, "page"):
        print_page(page, depth)
        subpage = page.find("subpages")
        if subpage is not None:
            find_pages(subpage, depth+1)

find_pages(tree)
Txema
  • 849
  • 5
  • 15
-2

This is another easy way of doing this without using an XML library:

depth = 0
for i in range(int(input())):
    tab = input().count('    ')
    if tab > depth:
        depth = tab
print(depth)
Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
Darkstar Dream
  • 1,649
  • 1
  • 12
  • 23