43

I'm interested in advice/pseudocode code/explanation rather than actual implementation.

  • I'd like to go through XML document, all of its nodes
  • Check the node for attribute existence

Case if node doesn't have attribute, get/generate String with value of its xpath
Case if node does have attributes, iterate through attribute list and create xpath for each attribute including the node as well.

Edit

My reason for doing this is: I'm writing automated tests in Jmeter, so for every request I need to verify that request actually did its job so I'm asserting results by getting nodes values with Xpath.

When the request is small it's not a problem to create asserts by hand, but for larger ones it's really a pain.

I'm looking for Java approach.

Goal

My goal is to achieve following from this example XML file :

<root>
    <elemA>one</elemA>
    <elemA attribute1='first' attribute2='second'>two</elemA>
    <elemB>three</elemB>
    <elemA>four</elemA>
    <elemC>
        <elemB>five</elemB>
    </elemC>
</root>

to produce the following :

//root[1]/elemA[1]='one'
//root[1]/elemA[2]='two'
//root[1]/elemA[2][@attribute1='first']
//root[1]/elemA[2][@attribute2='second']
//root[1]/elemB[1]='three'
//root[1]/elemA[3]='four'
//root[1]/elemC[1]/elemB[1]='five'

Explained :

  • If node value/text is not null/zero, get xpath , add = 'nodevalue' for assertion purpose
  • If node has attributes create assert for them too

Update

I found this example, it doesn't produce the correct results, but I'm looking something like this:

http://www.coderanch.com/how-to/java/SAXCreateXPath

halfer
  • 19,824
  • 17
  • 99
  • 186
ant
  • 22,634
  • 36
  • 132
  • 182
  • Good question, +1. See my answer for a complete XSLT 1.0 solution that takes a parameter that contains a node-set and produces an XPath expression for every node in this node-set. The nodes can be of any type: document-node, element, text-node, attribute, comment, PI, namespace. – Dimitre Novatchev Jan 20 '11 at 13:49
  • What kind of XPath expression do you want though? You can simply take the index of each element in its parent's `getChildren()` nodelist and create an xpath like `/*[5]/*[2]/*[8]/@yourattr`. But if you want to assert results, shouldn't you be doing it the other way around? Write an xpath expression that returns true if your XML is correct and false if it isn't, then evaluate it? – biziclop Jan 23 '11 at 14:45
  • @biziclop I want to create xpaths from request I send(so I can use it to verify the results), not the other way arround. I updated my question – ant Jan 23 '11 at 15:01
  • @c0mrade: There are holes in your updated question. What if an element has more than one text node like in: `text 1text 2` How should the wanted solution process any such element? I will update my answer with both an XSLT solution and a C# solution (my Java is a bit rusty) -- will this be useful to you? – Dimitre Novatchev Jan 23 '11 at 15:58
  • @Dimitre Novatchev thank you for commenting, as far as I can see that case never occurs in my xml files, and I don't think it will. As BalusC suggested I could let java run XSLT, if it produces the correct output as example I posted above. tnx – ant Jan 23 '11 at 16:02
  • @c0mrade: That is good to know, thanks. So, it may be useful if you put this clarification in the question itself. From your last comment, may I conclude that going forward with XSLT and possibly C# solution will be valuable to you? – Dimitre Novatchev Jan 23 '11 at 16:12
  • @Dimitre Novatchev yes it would be most welcome. Thank you – ant Jan 23 '11 at 16:18
  • @c0mrade: I have produced a complete and very short (30 lines) XSLT solution that is also easy to understand and solves your problem exactly. – Dimitre Novatchev Jan 23 '11 at 17:03
  • @c0mrade: I have also added a step-by-step explanation of the solution. Thank you for your appreciation. – Dimitre Novatchev Jan 23 '11 at 17:07
  • @Dimitre Novatchev thanks really appreciate it – ant Jan 23 '11 at 17:23
  • @c0mrade: Thank you for the new refinement of the problem. Yes, it was easy to adjust my solution to deal with the updated format. I have updated in my answer both the code and the explanation. Thank you for providing this nice problem. – Dimitre Novatchev Jan 24 '11 at 14:26
  • Possible duplicate of [how to retrieve corresponding xpath](http://stackoverflow.com/questions/1956534/how-to-retrieve-corresponding-xpath) – james.garriss Nov 10 '15 at 13:58

8 Answers8

49

Update:

@c0mrade has updated his question. Here is a solution to it:

This XSLT transformation:

<xsl:stylesheet version="1.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    <xsl:strip-space elements="*"/>
    
    <xsl:variable name="vApos">'</xsl:variable>
    
    <xsl:template match="*[@* or not(*)] ">
      <xsl:if test="not(*)">
         <xsl:apply-templates select="ancestor-or-self::*" mode="path"/>
         <xsl:value-of select="concat('=',$vApos,.,$vApos)"/>
         <xsl:text>&#xA;</xsl:text>
        </xsl:if>
        <xsl:apply-templates select="@*|*"/>
    </xsl:template>
    
    <xsl:template match="*" mode="path">
        <xsl:value-of select="concat('/',name())"/>
        <xsl:variable name="vnumPrecSiblings" select=
         "count(preceding-sibling::*[name()=name(current())])"/>
        <xsl:if test="$vnumPrecSiblings">
            <xsl:value-of select="concat('[', $vnumPrecSiblings +1, ']')"/>
        </xsl:if>
    </xsl:template>
    
    <xsl:template match="@*">
        <xsl:apply-templates select="../ancestor-or-self::*" mode="path"/>
        <xsl:value-of select="concat('[@',name(), '=',$vApos,.,$vApos,']')"/>
        <xsl:text>&#xA;</xsl:text>
    </xsl:template>
</xsl:stylesheet>

when applied on the provided XML document:

<root>
    <elemA>one</elemA>
    <elemA attribute1='first' attribute2='second'>two</elemA>
    <elemB>three</elemB>
    <elemA>four</elemA>
    <elemC>
        <elemB>five</elemB>
    </elemC>
</root>

produces exactly the wanted, correct result:

/root/elemA='one'
/root/elemA[2]='two'
/root/elemA[2][@attribute1='first']
/root/elemA[2][@attribute2='second']
/root/elemB='three'
/root/elemA[3]='four'
/root/elemC/elemB='five'

When applied to the newly-provided document by @c0mrade:

<root>
    <elemX serial="kefw90234kf2esda9231">
        <id>89734</id>
    </elemX>
</root>

again the correct result is produced:

/root/elemX[@serial='kefw90234kf2esda9231']
/root/elemX/id='89734'

Explanation:

  • Only elements that have no children elements, or have attributes are matched and processed.

  • For any such element, if it doesn't have children-elements all of its ancestor-or self elements are processed in a specific mode, named 'path'. Then the "='theValue'" part is output and then a NL character.

  • All attributes of the matched element are then processed.

  • Then finally, templates are applied to all children-elements.

  • Processing an element in the 'path' mode is simple: A / character and the name of the element are output. Then, if there are preceding siblings with the same name, a "[numPrecSiblings+1]` part is output.

  • Processing of attributes is simple: First all ancestor-or-self:: elements of its parent are processed in 'path' mode, then the [attrName=attrValue] part is output, followed by a NL character.

Do note:

  • Names that are in a namespace are displayed without any problem and in their initial readable form.

  • To aid readability, an index of [1] is never displayed.


Below is my initial answer (may be ignored)

Here is a pure XSLT 1.0 solution:

Below is a sample xml document and a stylesheet that takes a node-set parameter and produces one valid XPath expression for every member-node.

stylesheet (buildPath.xsl):


<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:msxsl="urn:schemas-microsoft-com:xslt" 
>

<xsl:output method="text"/>
<xsl:variable name="theParmNodes" select="//namespace::*[local-name() =
'myNamespace']"/>
<xsl:template match="/">
  <xsl:variable name="theResult">
    <xsl:for-each select="$theParmNodes">
    <xsl:variable name="theNode" select="."/>
    <xsl:for-each select="$theNode |
$theNode/ancestor-or-self::node()[..]">
      <xsl:element name="slash">/</xsl:element>
      <xsl:choose>
        <xsl:when test="self::*">           
          <xsl:element name="nodeName">
            <xsl:value-of select="name()"/>
            <xsl:variable name="thisPosition" 
                select="count(preceding-sibling::*[name(current()) = 
                        name()])"/>
            <xsl:variable name="numFollowing" 
                select="count(following-sibling::*[name(current()) = 
                        name()])"/>
            <xsl:if test="$thisPosition + $numFollowing > 0">
              <xsl:value-of select="concat('[', $thisPosition +
                                                           1, ']')"/>
            </xsl:if>
          </xsl:element>
        </xsl:when>
        <xsl:otherwise> <!-- This node is not an element -->
          <xsl:choose>
            <xsl:when test="count(. | ../@*) = count(../@*)">   
            <!-- Attribute -->
              <xsl:element name="nodeName">
                <xsl:value-of select="concat('@',name())"/>
              </xsl:element>
            </xsl:when>     
            <xsl:when test="self::text()">  <!-- Text -->
              <xsl:element name="nodeName">
                <xsl:value-of select="'text()'"/>
                <xsl:variable name="thisPosition" 
                          select="count(preceding-sibling::text())"/>
                <xsl:variable name="numFollowing" 
                          select="count(following-sibling::text())"/>
                <xsl:if test="$thisPosition + $numFollowing > 0">
                  <xsl:value-of select="concat('[', $thisPosition + 
                                                           1, ']')"/>
                </xsl:if>
              </xsl:element>
            </xsl:when>     
            <xsl:when test="self::processing-instruction()">
            <!-- Processing Instruction -->
              <xsl:element name="nodeName">
                <xsl:value-of select="'processing-instruction()'"/>
                <xsl:variable name="thisPosition" 
                   select="count(preceding-sibling::processing-instruction())"/>
                <xsl:variable name="numFollowing" 
                    select="count(following-sibling::processing-instruction())"/>
                <xsl:if test="$thisPosition + $numFollowing > 0">
                  <xsl:value-of select="concat('[', $thisPosition + 
                                                            1, ']')"/>
                </xsl:if>
              </xsl:element>
            </xsl:when>     
            <xsl:when test="self::comment()">   <!-- Comment -->
              <xsl:element name="nodeName">
                <xsl:value-of select="'comment()'"/>
                <xsl:variable name="thisPosition" 
                         select="count(preceding-sibling::comment())"/>
                <xsl:variable name="numFollowing" 
                         select="count(following-sibling::comment())"/>
                <xsl:if test="$thisPosition + $numFollowing > 0">
                  <xsl:value-of select="concat('[', $thisPosition + 
                                                            1, ']')"/>
                </xsl:if>
              </xsl:element>
            </xsl:when>     
            <!-- Namespace: -->
            <xsl:when test="count(. | ../namespace::*) = 
                                               count(../namespace::*)">

              <xsl:variable name="apos">'</xsl:variable>
              <xsl:element name="nodeName">
                <xsl:value-of select="concat('namespace::*', 
                '[local-name() = ', $apos, local-name(), $apos, ']')"/>

              </xsl:element>
            </xsl:when>     
          </xsl:choose>
        </xsl:otherwise>            
      </xsl:choose>
    </xsl:for-each>
    <xsl:text>&#xA;</xsl:text>
  </xsl:for-each>
 </xsl:variable>
 <xsl:value-of select="msxsl:node-set($theResult)"/>
</xsl:template>
</xsl:stylesheet>

xml source (buildPath.xml):


<!-- top level Comment -->
<root>
    <nodeA>textA</nodeA>
 <nodeA id="nodeA-2">
  <?myProc ?>
        xxxxxxxx
  <nodeB/>
        <nodeB xmlns:myNamespace="myTestNamespace">
  <!-- Comment within /root/nodeA[2]/nodeB[2] -->
   <nodeC/>
  <!-- 2nd Comment within /root/nodeA[2]/nodeB[2] -->
        </nodeB>
        yyyyyyy
  <nodeB/>
  <?myProc2 ?>
    </nodeA>
</root>
<!-- top level Comment -->

Result:

/root/nodeA[2]/nodeB[2]/namespace::*[local-name() = 'myNamespace']
/root/nodeA[2]/nodeB[2]/nodeC/namespace::*[local-name() =
'myNamespace']
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • @Dimitre Novatchev thank you for your answer but I'm looking for java approach, +1 for your effort – ant Jan 23 '11 at 14:20
  • 3
    Just let Java run the XSLT and collect its results? – BalusC Jan 23 '11 at 15:03
  • 1
    @BalusC I could do that but this is not exactly what I've asked, and since I don't know this code I'm more comfortable with code I can update/edit, I updated my question. tnx – ant Jan 23 '11 at 15:08
  • 1
    @Dimitre Novatchev Great it works exactly as I want. I'm really impressed by the small size of code and what it does. Looks like you know you way arround xsl/xml I'll have to explore xsl definitely. Can you recommend some useful web/book resources for me to go trough? I've already bookmarked your blog, seen tons of code there which I don't really get I need to start with basics work my way to the top. Great tnx once again, I can accept bounty in 21h, I will when that time expires. Thanks for the help – ant Jan 23 '11 at 17:03
  • 2
    @c0mrade: You are welcome. Yes, XSLT is a very powerful language. For more resources, please, have a look at my answer to another SO question: http://stackoverflow.com/questions/339930/any-good-xslt-tutorial-book-blog-site-online/341589#341589 – Dimitre Novatchev Jan 23 '11 at 17:06
  • @Dimitre Novatchev, please see BOUNTY UPDATE II, I updated my question. After analyzing bigger xml file I noticed this one, again I think I didn't give the correct example in my question. Is this a big change in your code? Can you please change it to work with latest update? I will accept the bounty either way in 5 hours when I'm able to. – ant Jan 24 '11 at 09:25
  • 1
    @Dimitre Novatchev absolutely amazing, thanks a million. It works exactly as I planned. I will definitely have to go trough links you suggested. thanks – ant Jan 24 '11 at 15:21
20

Here is how this can be done with SAX:

import java.util.HashMap;
import java.util.Map;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;

public class FragmentContentHandler extends DefaultHandler {

    private String xPath = "/";
    private XMLReader xmlReader;
    private FragmentContentHandler parent;
    private StringBuilder characters = new StringBuilder();
    private Map<String, Integer> elementNameCount = new HashMap<String, Integer>();

    public FragmentContentHandler(XMLReader xmlReader) {
        this.xmlReader = xmlReader;
    }

    private FragmentContentHandler(String xPath, XMLReader xmlReader, FragmentContentHandler parent) {
        this(xmlReader);
        this.xPath = xPath;
        this.parent = parent;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
        Integer count = elementNameCount.get(qName);
        if(null == count) {
            count = 1;
        } else {
            count++;
        }
        elementNameCount.put(qName, count);
        String childXPath = xPath + "/" + qName + "[" + count + "]";

        int attsLength = atts.getLength();
        for(int x=0; x<attsLength; x++) {
            System.out.println(childXPath + "[@" + atts.getQName(x) + "='" + atts.getValue(x) + ']');
        }

        FragmentContentHandler child = new FragmentContentHandler(childXPath, xmlReader, this);
        xmlReader.setContentHandler(child);
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        String value = characters.toString().trim();
        if(value.length() > 0) {
            System.out.println(xPath + "='" + characters.toString() + "'");
        }
        xmlReader.setContentHandler(parent);
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        characters.append(ch, start, length);
    }

}

It can be tested with:

import java.io.FileInputStream;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.InputSource;
import org.xml.sax.XMLReader;

public class Demo {

    public static void main(String[] args) throws Exception {
        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spf.newSAXParser();
        XMLReader xr = sp.getXMLReader();

        xr.setContentHandler(new FragmentContentHandler(xr));
        xr.parse(new InputSource(new FileInputStream("input.xml")));
    }
}

This will produce the desired output:

//root[1]/elemA[1]='one'
//root[1]/elemA[2][@attribute1='first]
//root[1]/elemA[2][@attribute2='second]
//root[1]/elemA[2]='two'
//root[1]/elemB[1]='three'
//root[1]/elemA[3]='four'
//root[1]/elemC[1]/elemB[1]='five'
bdoughan
  • 147,609
  • 23
  • 300
  • 400
  • 3
    Nice one :) All we need now is a StAX implementation and we'll have the full set. – biziclop Jan 24 '11 at 14:39
  • +1 for your effort, I second biziclop s comment, someone could find it to be useful in the future – ant Jan 24 '11 at 15:26
  • 1
    Wait a minute... `elementNameCount` counts occurrences of a particular element type (name) globally across the document, regardless of whether they are siblings, cousins (same level but different parent), or on different levels. But you output XPath `"[" + count + "]"` as if we're counting position among siblings. This will clearly fail for nontrivial documents. Right? E.g. `foo` would output `//a[1]/a[2]='foo'`, and the `[2]` is incorrect. – LarsH Feb 15 '12 at 15:47
  • @BlaiseDoughan Can you please hava look at this question - http://stackoverflow.com/questions/10698287/xpath-transformation-not-working-in-java . I am using xml signatures in java and for that I have to extract the part to be signed by using xpath. But it just doesn't work. – Ashwin May 23 '12 at 07:57
  • 1
    @LarsH no it's not, because there's a new FragmentContentHandler created at each startElement transition with it's own elementNameCount registry. This should work correctly, but have to try it myself. – NagyI Nov 03 '16 at 09:38
  • @Nagyl: You may be right. I haven't looked at this in 3.5 years. :-) Let us know if you test it. – LarsH Nov 03 '16 at 17:10
13

With jOOX (a jquery API port to Java, disclaimer - I work for the company behind the library), you can almost achieve what you want in a single statement:

// I'm assuming this:
import static org.joox.JOOX.$;

// And then...
List<String> coolList = $(document).xpath("//*[not(*)]").map(
    context -> $(context).xpath() + "='" + $(context).text() + "'"
);

If document is your sample document:

<root>
    <elemA>one</elemA>
    <elemA attribute1='first' attribute2='second'>two</elemA>
    <elemB>three</elemB>
    <elemA>four</elemA>
    <elemC>
        <elemB>five</elemB>
    </elemC>
</root>

This will produce

/root[1]/elemA[1]='one'
/root[1]/elemA[2]='two'
/root[1]/elemB[1]='three'
/root[1]/elemA[3]='four'
/root[1]/elemC[1]/elemB[1]='five'

By "almost", I mean that jOOX does not (yet) support matching/mapping attributes. Hence, your attributes will not produce any output. This will be implemented in the near future, though.

Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
  • Can you please hava look at this question - http://stackoverflow.com/questions/10698287/xpath-transformation-not-working-in-java . I am using xml signatures in java and for that I have to extract the part to be signed by using xpath. But it just doesn't work – Ashwin May 23 '12 at 07:58
  • @Ashwin: I'm sorry, I don't have any experience with "XPath transformation". I don't recognise that library you're using there – Lukas Eder May 23 '12 at 11:19
  • what's with the dollar sign `$`? That's legal Java?! – Jason S Dec 01 '15 at 18:29
  • @JasonS It's a legal identifier, yes. It's static-imported from `JOOX.$`. I'll update the answer – Lukas Eder Dec 01 '15 at 18:43
  • This works great but not on large XML files. Any recommendations? – Brian T Hannan Feb 03 '17 at 17:32
  • @BrianTHannan: You could implement a SAX handler – Lukas Eder Feb 03 '17 at 17:49
4
private static void buildEntryList( List<String> entries, String parentXPath, Element parent ) {
    NamedNodeMap attrs = parent.getAttributes();
    for( int i = 0; i < attrs.getLength(); i++ ) {
        Attr attr = (Attr)attrs.item( i );
        //TODO: escape attr value
        entries.add( parentXPath+"[@"+attr.getName()+"='"+attr.getValue()+"']"); 
    }
    HashMap<String, Integer> nameMap = new HashMap<String, Integer>();
    NodeList children = parent.getChildNodes();
    for( int i = 0; i < children.getLength(); i++ ) {
        Node child = children.item( i );
        if( child instanceof Text ) {
            //TODO: escape child value
            entries.add( parentXPath+"='"+((Text)child).getData()+"'" );
        } else if( child instanceof Element ) {
            String childName = child.getNodeName();
            Integer nameCount = nameMap.get( childName );
            nameCount = nameCount == null ? 1 : nameCount + 1;
            nameMap.put( child.getNodeName(), nameCount );
            buildEntryList( entries, parentXPath+"/"+childName+"["+nameCount+"]", (Element)child);
        }
    }
}

public static List<String> getEntryList( Document doc ) {
    ArrayList<String> entries = new ArrayList<String>();
    Element root = doc.getDocumentElement();
    buildEntryList(entries, "/"+root.getNodeName()+"[1]", root );
    return entries;
}

This code works with two assumptions: you aren't using namespaces and there are no mixed content elements. The namespace limitation isn't a serious one, but it'd make your XPath expression much harder to read, as every element would be something like *:<name>[namespace-uri()='<nsuri>'][<index>], but otherwise it's easy to implement. Mixed content on the other hand would make the use of xpath very tedious, as you'd have to be able to individually address the second, third and so on text node within an element.

biziclop
  • 48,926
  • 12
  • 77
  • 104
2
  1. use w3c.dom
  2. go recursively down
  3. for each node there is easy way to get it's xpath: either by storing it as array/list while #2, or via function which goes recursively up until parent is null, then reverses array/list of encountered nodes.

something like that.

UPD: and concatenate final list in order to get final xpath. don't think attributes will be a problem.

andbi
  • 4,426
  • 5
  • 45
  • 70
1

I've done a similar task once. The main idea used was that you can use indexes of the element in the xpath. For example in the following xml

<root>
    <el />
    <something />
    <el />
</root>

xpath to the second <el/> will be /root[1]/el[2] (xpath indexes are 1-based). This reads as "take the first root, then take the second one from all elements with the name el". So element something does not affect indexing of elements el. So you can in theory create an xpath for each specific element in your xml. In practice I've accomplished this by walking the tree recursevely and remembering information about elements and their indexes along the way.
Creating xpath referencing specific attribute of the element then was just adding '/@attrName' to element's xpath.

alpha-mouse
  • 4,953
  • 24
  • 36
1

I have written a method to return the absolute path of an element in the Practical XML library. To give you an idea of how it works, here's an extract form one of the unit tests:

assertEquals("/root/wargle[2]/zargle",
             DomUtil.getAbsolutePath(child3a)); 

So, you could recurse through the document, apply your tests, and use this to return the XPath. Or, what is probably better, is that you could use the XPath-based assertions from that same library.

kdgregory
  • 38,754
  • 10
  • 77
  • 102
1

I did the exact same thing last week for processing my xml to solr compliant format.

Since you wanted a pseudo code: This is how I accomplished that.

// You can skip the reference to parent and child.

1_ Initialize a custom node object: NodeObjectVO {String nodeName, String path, List attr, NodeObjectVO parent, List child}

2_ Create an empty list

3_ Create a dom representation of xml and iterate thro the node. For each node, get the corresponding information. All the information like Node name,attribute names and value should be readily available from dom object. ( You need to check the dom NodeType, code should ignore processing instruction and plain text nodes.)

// Code Bloat warning. 4_ The only tricky part is get path. I created an iterative utility method to get the xpath string from NodeElement. (While(node.Parent != null ) { path+=node.parent.nodeName}.

(You can also achieve this by maintaining a global path variable, that keeps track of the parent path for each iteration.)

5_ In the setter method of setAttributes (List), I will append the object's path with all the available attributes. (one path with all available attributes. Not a list of path with each possible combination of attributes. You might want to do someother way. )

6_ Add the NodeObjectVO to the list.

7_ Now we have a flat (not hierrarchial) list of custom Node Objects, that have all the information I need.

(Note: Like I mentioned, I maintain parent child relationship, you should probably skip that part. There is a possibility of code bloating, especially while getparentpath. For small xml this was not a problem, but this is a concern for large xml).

uncaught_exceptions
  • 21,712
  • 4
  • 41
  • 48