5

_ Hi , this is my web page :

<html>
    <head>
    </head>
    <body>
        <div> text div 1</div>
        <div>
            <span>text of first span </span>
            <span>text of second span </span>
        </div>
        <div> text div 3 </div>
    </body>
</html>

I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :

 Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
 Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
        for (Element element : elements) {
            if (!element.ownText().isEmpty()) {

                StringBuilder path = new StringBuilder(element.nodeName());
                String value = element.ownText();
                Elements p_el = element.parents();

                for (Element el : p_el) {
                    path.insert(0, el.nodeName() + '/');
                }
                all.add(path + " = " + value + "\n");
                System.out.println(path +" = "+ value);
            }
        }

        return all;

my code give me this result :

html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3

in fact i want get result like this :

html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3

please could any one give me idea how to get reach this result :) . thanks in advance.

Stephan
  • 41,764
  • 65
  • 238
  • 329
kivok94
  • 303
  • 1
  • 4
  • 11

4 Answers4

2

As asked here a idea. Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".

Here the possible solution based on your current attempt.

For each (parent) element check if there are more than one element with this name. Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1

Community
  • 1
  • 1
hr_117
  • 9,589
  • 1
  • 18
  • 23
1

This is my solution to this problem:

StringBuilder absPath=new StringBuilder();
Elements parents = htmlElement.parents();

for (int j = parents.size()-1; j >= 0; j--) {
    Element element = parents.get(j);
    absPath.append("/");
    absPath.append(element.tagName());
    absPath.append("[");
    absPath.append(element.siblingIndex());
    absPath.append("]");
}
techspider
  • 3,370
  • 13
  • 37
  • 61
0

This would be easier, if you traversed the document from the root to the leafs instead of the other way round. This way you can easily group the elements by tag-name and handle multiple occurences accordingly. Here is a recursive approach:

private final List<String> path = new ArrayList<>();
private final List<String> all = new ArrayList<>();

public List<String> getAll() {
    return Collections.unmodifiableList(all);
}

public void parse(Document doc) {
    path.clear();
    all.clear();
    parse(doc.children());
}

private void parse(List<Element> elements) {
    if (elements.isEmpty()) {
        return;
    }
    Map<String, List<Element>> grouped = elements.stream().collect(Collectors.groupingBy(Element::tagName));

    for (Map.Entry<String, List<Element>> entry : grouped.entrySet()) {
        List<Element> list = entry.getValue();
        String key = entry.getKey();
        if (list.size() > 1) {
            int index = 1;
            // use paths with index
            key += "[";
            for (Element e : list) {
                path.add(key + (index++) + "]");
                handleElement(e);
                path.remove(path.size() - 1);
            }
        } else {
            // use paths without index
            path.add(key);
            handleElement(list.get(0));
            path.remove(path.size() - 1);
        }
    }

}

private void handleElement(Element e) {
    String value = e.ownText();
    if (!value.isEmpty()) {
        // add entry
        all.add(path.stream().collect(Collectors.joining("/")) + " = " + value);
    }
    // process children of element
    parse(e.children());
}
fabian
  • 80,457
  • 12
  • 86
  • 114
  • ur answer is near of what i want , i'll only made some changes and it will work perfectly , because now it give result like this – kivok94 Mar 26 '16 at 11:28
  • div[1] = text div 1 div[2]/span[1] = text of first span div[2]/span[2] = text of second span div[3] = text div 2 body/div[1] = text div 1 body/div[2]/span[1] = text of first span body/div[2]/span[2] = text of second span body/div[3] = text div 2 span[1] = text of first span span[2] = text of second span – kivok94 Mar 26 '16 at 11:28
0

Here is the solution in Kotlin. It's correct, and it works. The other answers are wrong and caused me hours of lost work.

fun Element.xpath(): String = buildString {
    val parents = parents()

    for (j in (parents.size - 1) downTo 0) {
        val parent = parents[j]
        append("/*[")
        append(parent.siblingIndex() + 1)
        append(']')
    }

    append("/*[")
    append(siblingIndex() + 1)
    append(']')
}
spierce7
  • 14,797
  • 13
  • 65
  • 106