12

I am wondering if is possible to create an XSLT stylesheet that would extract XPATHs for all leaf elements in a given XML file. E.g. for

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <item1>value1</item1>
    <subitem>
        <item2>value2</item2>
    </subitem>
</root>

The output would be

/root/item1
/root/subitem/item2
svick
  • 236,525
  • 50
  • 385
  • 514
the_joric
  • 11,986
  • 6
  • 36
  • 57

4 Answers4

18
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="text" indent="no" />

    <xsl:template match="*[not(*)]">
        <xsl:for-each select="ancestor-or-self::*">
            <xsl:value-of select="concat('/', name())"/>

            <xsl:if test="count(preceding-sibling::*[name() = name(current())]) != 0">
                <xsl:value-of select="concat('[', count(preceding-sibling::*[name() = name(current())]) + 1, ']')"/>
            </xsl:if>
        </xsl:for-each>
        <xsl:text>&#xA;</xsl:text>
        <xsl:apply-templates select="*"/>
    </xsl:template>

    <xsl:template match="*">
        <xsl:apply-templates select="*"/>
    </xsl:template>

</xsl:stylesheet>

outputs:

/root/item1
/root/subitem/item2
Kirill Polishchuk
  • 54,804
  • 11
  • 122
  • 125
  • Kirill, it looks like both you and @Dimitre omit the `[1]` for the first in a series of same-named siblings, yet that means the generated XPath for the first sibling will select all the siblings. – LarsH Jan 30 '12 at 16:35
  • 1
    Very useful ! Thanks. – GhislainCote Jul 28 '15 at 17:23
9

This transformation:

<xsl:stylesheet version="1.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output omit-xml-declaration="yes" indent="yes"/>
        <xsl:strip-space elements="*"/>

        <xsl:variable name="vApos">'</xsl:variable>

        <xsl:template match="*[@* or not(*)] ">
          <xsl:if test="not(*)">
             <xsl:apply-templates select="ancestor-or-self::*" mode="path"/>
             <xsl:text>&#xA;</xsl:text>
            </xsl:if>
            <xsl:apply-templates select="@*|*"/>
        </xsl:template>

        <xsl:template match="*" mode="path">
            <xsl:value-of select="concat('/',name())"/>
            <xsl:variable name="vnumSiblings" select=
             "count(../*[name()=name(current())])"/>
            <xsl:if test="$vnumSiblings > 1">
                <xsl:value-of select=
                 "concat('[',
                         count(preceding-sibling::*
                                [name()=name(current())]) +1,
                         ']')"/>
            </xsl:if>
        </xsl:template>

        <xsl:template match="@*">
            <xsl:apply-templates select="../ancestor-or-self::*" mode="path"/>
            <xsl:value-of select="concat('[@',name(), '=',$vApos,.,$vApos,']')"/>
            <xsl:text>&#xA;</xsl:text>
        </xsl:template>
</xsl:stylesheet>

when applied on the provided XML document:

<root>
    <item1>value1</item1>
    <subitem>
        <item2>value2</item2>
    </subitem>
</root>

produces the wanted, correct result:

/root/item1
/root/subitem/item2

With this XML document:

<root>
    <item1>value1</item1>
    <subitem>
        <item>value2</item>
        <item>value3</item>
    </subitem>
</root>

it correctly produces:

/root/item1
/root/subitem/item[1]
/root/subitem/item[2]

See also this related answer: https://stackoverflow.com/a/4747858/36305

Community
  • 1
  • 1
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • Wow, thats impressive :) thx! However I've marked Kirill's answer, since he was first and after edit his script produces correct result. – the_joric Jan 30 '12 at 15:15
  • Dmitri can you describe what the template for attributes is doing, and why you included it? – LarsH Jan 30 '12 at 15:25
  • @LarsH: This is a generic solution, that produces an XPath expression for every "leaf" element node and for every attribute. Had there been any attributes in the provided XML document, the corresponding XPath expression for any attribute would also have been produced. – Dimitre Novatchev Jan 30 '12 at 15:29
  • @the_joric: You are welcome. As for the other answer corrected by edit, yes, it is very easy to correct a wrong answer after you have one correct answer published -- just copy/paste ... The fact which was the first *correct* answer published remains unchanged ... :) – Dimitre Novatchev Jan 30 '12 at 15:33
  • Oops, sorry I misspelled your name in previous comment. Also, I was wondering why you used `../ancestor-or-self::*` instead of `ancestor::*`. At first I thought it might be because the latter would yield reverse document order. But then I realized it wouldn't matter... apply-templates ignores the order of the select, and there was nothing else in that XPath expression to be affected by the reverse order. – LarsH Jan 30 '12 at 16:14
  • 1
    @Dimitre, re: attribute nodes: I don't think it's fruitful to be too pedantic about which was the first correct answer. After all, the question only asked for elements, but your answer gives paths for attributes as well. – LarsH Jan 30 '12 at 16:18
  • P.S. Martin's comment is insightful: when it comes to namespaces, things really *do* get complicated. Since namespace prefix bindings can be different in one part of a document from another, and since a given XPath (1.0 at least) expression cannot change bindings in midstream, it seems to me that your and Kirill's solutions can both give incorrect results in general. E.g. `` What XPath do you generate for `c:b` and `e:b`? – LarsH Jan 30 '12 at 16:30
  • @LarsH: Re: attribute nodes: I think that my answer has *added value* and is in fact evemn more valuable thatn the OP expected. To quote Henry Ford: "If I did what people wanted, then I would have given them a better horse" :) – Dimitre Novatchev Jan 30 '12 at 16:55
  • 1
    @LarsH: I updated my answer so that noe `[1]` isn't omitted whaen it is necessary. Thanks for the observation. – Dimitre Novatchev Jan 30 '12 at 17:03
  • Re: Henry Ford: next time you ask for a list of prime numbers, I'll give you a list of primes with all the composites (indistinguishably mixed), for "added value." ;-) – LarsH Jan 30 '12 at 22:47
4

I think the following correction only matters in unusual cases where different prefixes are used for the same namespaces, or different namespaces for the same prefix, among sibling elements in a document. However there is nothing theoretically wrong with such input, and it could be common in certain kinds of generated XML.

Anyway, the following answer fixes that case (copied-and-modified from @Kirill's answer):

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

   <xsl:output method="text" indent="no" />

   <xsl:template match="*[not(*)]">
      <xsl:for-each select="ancestor-or-self::*">
         <xsl:value-of select="concat('/', name())"/>

         <!-- Suggestions on how to refactor the repetition of long XPath
              expression parts are welcome. -->
         <xsl:if test="count(../*[local-name() = local-name(current())
               and namespace-uri(.) = namespace-uri(current())]) > 1">
            <xsl:value-of select="concat('[', count(
               preceding-sibling::*[local-name() = local-name(current())
               and namespace-uri(.) = namespace-uri(current())]) + 1, ']')"/>
         </xsl:if>
      </xsl:for-each>
      <xsl:text>&#xA;</xsl:text>
      <xsl:apply-templates select="*"/>
   </xsl:template>

   <xsl:template match="*">
      <xsl:apply-templates select="*"/>
   </xsl:template>

</xsl:stylesheet>

It also addresses the problem in other answers where elements that are first in a series of siblings lack a position predicate.

E.g. for the input

<root>
   <item1>value1</item1>
   <subitem>
      <a:item xmlns:a="uri">value2</a:item>
      <b:item xmlns:b="uri">value3</b:item>
   </subitem>
</root>

this answer produces

/root/item1
/root/subitem/a:item[1]
/root/subitem/b:item[2]

which is correct.

However, like all XPath expressions, these will only work if the environment using them specifies correct bindings for the namespace prefixes used. In theory there can be more pathological documents for which the above answer generates XPath expressions that can never work (in XPath 1.0 at least) regardless of the prefix bindings. E.g. this input:

<root>
   <item1>value1</item1>
   <a:subitem xmlns:a="differentURI">
      <a:item xmlns:a="uri">value2</a:item>
      <b:item xmlns:b="uri">value3</b:item>
   </a:subitem>
</root>

produces the output

/root/item1
/root/a:subitem/a:item[1]
/root/a:subitem/b:item[2]

But the second XPath expression here can never work, since the prefix a refers to two different namespaces in the same expression.

LarsH
  • 27,481
  • 8
  • 94
  • 152
2

Well you can find leaf elements with //*[not(*)] and of course you can for-each the ancestor-or-self axis then to output the path. But once you have namespaces involved generating XPath expressions becomes complicated.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • Very good and non-obvious point about namespaces. I think this complication is frequently overlooked when generating XPath expressions for nodes in a document. – LarsH Jan 30 '12 at 16:32