1

I have an XML document from which I want to extract the absolute path to a specific node (mynode) for later use. I retrieve the node like this:

from StringIO import StringIO
from lxml import etree

xml = """
<a1>
    <b1>
        <c1>content1</c1>
    </b1>
    <b1>
        <c1>content2</c1>
    </b1>
</a1>"""
root = etree.fromstring(xml)

i = 0
mynode = root.xpath('//c1')[i]

In order to get the path I currently use

ancestors = mynode.xpath('./ancestor::*')
p = ''.join( map( lambda x: '/' + x.tag , ancestors ) + [ '/' , mynode.tag ] )

p has now the value

/a1/b1/c1

However to store the path for later use I have to store the index i from the first code snippet aswell in order to retrieve the right node because an xpath query for p will contain both nodes c1. I do not want to store that index.

What would be better is a path for xquery which has the index included. For the first c1 node it could look like this:

/a1/b1[1]/c1

or this for the second c1 node

/a1/b1[2]/c1

Anyone an idea how this can be achieved? Is there another method to specify a node and access it later on?

Waschbaer
  • 447
  • 6
  • 18

1 Answers1

1
from lxml import etree
from io import StringIO, BytesIO

# ----------------------------------------------

def node_location(node):
    position = len(node.xpath('./preceding-sibling::' + node.tag)) + 1
    return '/' + node.tag + '[' + str(position) + ']'

def node_path(node):
    nodes = mynode.xpath('./ancestor-or-self::*')
    return ''.join( map(node_location, nodes) )

# ----------------------------------------------

xml = """
<a1>
    <b1>
        <c1>content1</c1>
    </b1>
    <b1>
        <c1>content2</c1>
    </b1>
</a1>"""

root = etree.fromstring(xml)

for mynode in root.xpath('//c1'):
    print node_path(mynode)

prints

/a1[1]/b1[1]/c1[1]
/a1[1]/b1[2]/c1[1]

Is there another method to specify a node and access it later on?

If you mean "persist across separate invocations of the program", then no, not really.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • "persist across separate invocations of the program" is what I meant. Thanks a lot for your effort – Waschbaer Oct 18 '15 at 17:57
  • 1
    If namespaces are used one hase to either remove them before using this function like [this](http://stackoverflow.com/questions/9316095/drop-all-namespaces-in-lxml) or alter the code accordingly because the tag attribute returns {mynamespace}mytag and in order to access the node via xquery one must divide namespace and tag with an colon like {mynamespace}:mytag – Waschbaer Oct 18 '15 at 18:36
  • True, namespaces need to be considered. Removing namespaces is not recommendable at all, though. Making the code above namespace-aware is only a minimal change anyway. – Tomalak Oct 18 '15 at 19:04