Return XPath for each XML text element

Question

I want to return a XPath or similar for each text element in the following XML. I tried XPathNodeIterator, but it seems return nodes under specified node level only. How can I get all nodes and sub-nodes and return a list of objects like below?

String exp = "/*/*/child::*";
XPathNodeIterator NodeIter = navigator.Select(exp);

XML:

<div>
    <p>Title</p>
    <ul>
       <li>Features</li>
    </ul>
    <ul>
       <li>Name</li>
       <li>Age</li>
       <li>Gender</li>
    </ul>
    <h2>Comments</h2>
    <p>Bill</p>
    <p>Link</p>
</div>

Desired Results: I want to get a list of something like (div/p[1], Title), (div/ul[1]/li[1], Features), (div/ul[2]/li[1], Name), (div/ul[2]/li[2], Age), (div/ul[2]/li[3], Gender), (div/h2[1], Comments), (div/p[2], Bill), (div/p[3], Link)

Since then is `[0]` XPath? The index starts at `1` with XPath. If you want to select elements not containing other elements then `//*[not(*)]` should do. As for generating XPath expressions, what do you want to do with XML and namespaces? — Martin Honnen, Apr 30 '15 at 16:40
I suggest you give recursion a try, come back and we'll help you if you have problems with it. — Chuck Savage, Apr 30 '15 at 16:45
I'm not sure I understand your question. Are you trying to 1) Find all `XmlElement` nodes that have a text value? Traverse an `XmlElement` hierarchy and generate an XPath query for each one that uniquely specifies it? — dbc, Apr 30 '15 at 18:38
@AlexW. If I understood the question, I have posted an answer that I believe will get the result you need. — jwatts1980, Apr 30 '15 at 21:54
Alex W.: This answer may help you: http://stackoverflow.com/a/4747858/36305 — Dimitre Novatchev, May 01 '15 at 04:07
It's for Linq-to-XML not the older `XmlNode` API, but still related: [Get the XPath to an XElement?](http://stackoverflow.com/questions/451950/get-the-xpath-to-an-xelement). — dbc, May 02 '15 at 03:25
Even more closely related: [How to get xpath from an XmlNode instance. C#](http://stackoverflow.com/questions/241238/how-to-get-xpath-from-an-xmlnode-instance-c-sharp). — dbc, May 02 '15 at 03:33

jwatts1980 · Accepted Answer · 2015-04-30T22:20:55.417

I was not able to find a built-in method that would give you the kind of path that you wanted. But I was able to create a recursive function that would do the trick. Here is the code I came up with:

    private void button1_Click(object sender, EventArgs e)
    {
        string xmlText = textBox1.Text;

        String exp = "//text()";
        XmlDocument xml = new XmlDocument();
        xml.LoadXml(xmlText);

        //Writes the text out to a textbox
        foreach (XmlNode x in xml.SelectNodes(exp))
            textBox2.AppendText("(" + GetPath(x) + ", " + x.InnerText + ")\n");
    }

    string GetPath(XmlNode nd)
    {
        if (nd.ParentNode != null && nd.NodeType == XmlNodeType.Text)
        {
            return GetPath(nd.ParentNode);
        }
        else if (nd.ParentNode != null && nd.NodeType != XmlNodeType.Text)
        {
            var index = nd.ParentNode.ChildNodes.Cast<XmlNode>().ToList().IndexOf(nd);
            string path = GetPath(nd.ParentNode);
            path += (path != "") ? "/" : "";
            return string.Format("{0}{1}[{2}]", path, nd.Name, index);
        }
        else return "";
    }

I was testing it on a Form, thus the button click event. Using //text() to get all text nodes was the easy part. Coming up with a recursive function to build the path was a little harder than I expected. It took me a bit to figure out that by casting ParentNode.ChildNodes to a collection of XmlNode, then converting to a list, we can use the IndexOf() method of List to get the index.

Results:

(div[0]/p[0], Title)
(div[0]/ul[1]/li[0], Features)
(div[0]/ul[2]/li[0], Name)
(div[0]/ul[2]/li[1], Age)
(div[0]/ul[2]/li[2], Gender)
(div[0]/h2[3], Comments)
(div[0]/p[4], Bill)
(div[0]/p[5], Link)

One caveat to this that I see, and because I don't know what application you will be using this for, but if you are going to be using this to get elements from HTML, the LoadXML() function may break. "Valid" HTML is not necessarily valid XML, and the load may fail.

Thanks jwatts1980. Your codes worked. I am modifying your solution to see if it can be done using XDocument. — Alex W., May 04 '15 at 15:28

score 1 · Answer 2 · answered May 01 '15 at 04:23

Just run this transformation in .NET (using XslCompiledTransform):

<xsl:stylesheet version="1.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output omit-xml-declaration="yes" indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:variable name="vApos">'</xsl:variable>

  <xsl:template match="text()">
     <xsl:apply-templates select="ancestor-or-self::*" mode="path"/>
     <xsl:value-of select="concat('=',$vApos,.,$vApos)"/>
     <xsl:text>&#xA;</xsl:text>
  </xsl:template>

  <xsl:template match="*" mode="path">
    <xsl:value-of select="concat('/',name())"/>
    <xsl:variable name="vnumPrecSiblings" select=
      "count(preceding-sibling::*[name()=name(current())])"/>
    <xsl:if test="$vnumPrecSiblings or following-sibling::*[name()=name(current())]">
        <xsl:value-of select="concat('[', $vnumPrecSiblings +1, ']')"/>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

When applied on the provided source XML document:

<div>
    <p>Title</p>
    <ul>
       <li>Features</li>
    </ul>
    <p/>
    <ul>
       <li>Name</li>
       <li>Age</li>
       <li>Gender</li>
    </ul>
    <h2>Comments</h2>
    <p>Bill</p>
    <p>Link</p>
</div>

the wanted, correct result is produced:

/div/p[1]='Title'
/div/ul[1]/li='Features'
/div/ul[2]/li[1]='Name'
/div/ul[2]/li[2]='Age'
/div/ul[2]/li[3]='Gender'
/div/h2='Comments'
/div/p[3]='Bill'
/div/p[4]='Link'

Thanks Dimitre Novatchev. I tested it and it worked too. – Alex W. May 04 '15 at 15:37 — Alex W., May 04 '15 at 15:37

Return XPath for each XML text element

2 Answers2