3

Background

Many questions ask how to obtain a particular DOM element given a CSS selector. This question is about the opposite direction. A document is parsed with jsoup, but could easily be converted to any of:

Use Case

For a particular problem domain (e.g., chemical compounds), thousands of web pages list chemicals in similar ways, but the mark-up differs across web sites. For example:

<div id="chemical-list">
  <div class="compound">
    <span class="compound-name">water</span>
    <span class="compound-periodic">H2O</span>
  </div>
  <div class="compound">
    <span class="compound-name">sodium hypochlorite</span>
    <span class="compound-periodic">NaClO</span>
  </div>
</div>

Another site might list them differently:

<ul class="chemical-compound">
  <li class="chem-name">water, H2O</li>
  <li class="chem-name">sodium hypochlorite, NaClO</li>
</ul>

Yet another site might, again, use different markup:

<table border="0" cellpadding="0" cellspacing="0">
  <tbody>
    <tr><td>water</td><td>H2O</td></tr>
    <tr><td>sodium hypochlorite</td><td>NaClO</td></tr>
  </tbody>
</table>

A few sample pages from each of the thousands of sites are downloaded. Then, using an existing list of chemicals, it is relatively simple to retrieve a list of candidate web page elements. Using jsoup, this is as simple as:

  Elements elements = chemicals.getElementsMatchingOwnText( chemicalNames );

This will allow for high-precision analysis across thousands of pages. (The page can discuss the applications for water and sodium hypochlorite, but only the list is being analyzed.) Knowing the CSS will greatly simplify the analysis and increase its accuracy.

The alternative is to process the entire page looking for "groups" of chemicals, then try to extract the list. Both problems are difficult, but using a CSS selector to jump to the exact spot in the page is far more efficient, and likely far more accurate. Both problems will require some hand-crafting, but I'd like to automate away as much as possible.

Problem

The aforementioned APIs do not appear to have methods that generate a CSS selector given an Element instance (the more unique the better). It is possible to iterate through the parent elements and generate the selector manually. This has been demonstrated using JavaScript in a few questions. There are also answers for generating an XPath, and it might be possible using Selenium.

Specifically, how would you do something like:

String selector = element.getCSSPath();
Elements elements = document.select( selector );

This would:

  1. Return the CSS selector for the given element.
  2. Search a document for the given CSS selector.
  3. Return a list of elements that match the selector.

The second line is not an issue; the first line is problematic.

Question

What API can generate a CSS selector (as unique as possible) from a DOM element?

If there is no existing API, then that would be nice to know.

Community
  • 1
  • 1
Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315
  • Jsoup doesn't provide this, but if it did, the most unique selector would be a selector that uses `>` and `:eq()` to mimic an XPath expression. It's not clear what the use of that would be -- it will select precisely that element and nothing more, so your sample code would be useless. What's your actual use case for such an API? – Jeffrey Bosboom Sep 21 '14 at 20:08

4 Answers4

2

As of 2014-09-28 / 1.8.1 JSoup has had this functionality (thanks to a pull request) through the method Element.cssSelector().

cssSelector

public String cssSelector() - Get a CSS selector that will uniquely select this element. If the element has an ID, returns #id; otherwise returns the parent (if any) CSS selector, followed by '>', followed by a unique selector for the element (tag.class.class:nth-child(n)).

Returns: the CSS Path that can be used to retrieve the element in a selector.

This returns selectors that return a unique element by using the element ID if present, otherwise creating a selector of the form tag.class.class:nth-child(n).

eg: "html > body > h2.section:nth-child(3)"

pringi
  • 3,987
  • 5
  • 35
  • 45
James Fry
  • 1,133
  • 1
  • 11
  • 28
1

Just use Java's actual JavaScript engine and run some plain JavaScript?

function getSelector(element) {
  var selector = element.id;

  // if we have an ID, that's all we need. IDs are unique. The end.
  if(selector.id) { return "#" + selector; }

  selector = [];
  var cl;
  while(element.parentNode) {
    cl = element.getAttribute("class");
    cl = cl ? "." + cl.trim().replace(/ +/g,'.') : '';
    selector.push(element.localName + cl);
    element = element.parentNode;
  }
  return selector.reverse().join(' ');
}

And let's verify that against

<div class="main">
  <ul class=" list of things">
    <li><a href="moo" class="link">lol</a></li>
  </ul>
</div>

with

var a = document.querySelector("a");
console.log(getSelector(a));

http://jsfiddle.net/c8k6Lxtj/ -- result: html body div.main ul.list.of.things li a.link... gold.

Mike 'Pomax' Kamermans
  • 49,297
  • 16
  • 112
  • 153
1

I used Mike's answer with the following change, to make the returned css selector, shorter.

Update: Also using name attribute to shorten the css selector and checking each iteration if the selector, so far, returns a single element on the page

Update: As @10basetom pointed out in the comments, in situations where the element doesn't have a unique id or a unique class name or a unique class name + name attribute, the method might produce a non unique css path, but it will produce the shortest css selector in other cases. So, I suggest to check the css path result using document.querySelectorAll(result).length === 1 and fallback on other methods described here

function getShortestSelector(element) {
    var selector = element.id;

    // if we have an ID, that's all we need. IDs are unique. The end.
    if(selector.id) {
        return "#" + selector;
    }

    selector = [];
    var cl, name;
    while(element.parentNode && (selector.length === 0 || document.querySelectorAll(selector.join(' ')).length !== 1)) {

        // if exist, add the first found id and finish building the selector
        var id = element.getAttribute("id");
        if (id) {
            selector.unshift("#" + id);
            break;
        }

        cl = element.getAttribute("class");
        cl = cl ? "." + cl.trim().replace(/ +/g,'.') : '';
        name = element.getAttribute("name");
        name = name ? ("[name=" + name.trim() + "]") : '';
        selector.unshift(element.localName + cl + name);
        element = element.parentNode;
    }

    var result = selector[0];
    if (selector.length > 1) {
        result += " " + selector.slice(1).join(" ").replace(/\[name=[^\]]*]/g, '');
    }

    return result;
}
Arik
  • 5,266
  • 1
  • 27
  • 26
  • This won't work in many cases. For example, see `getSelector1()` console output here: https://codepen.io/thdoan/pen/WjVRyG?editors=1111 – thdoan May 31 '17 at 10:19
0

As far as I can tell, no APIs offer this functionality. The following appears to work:

  /**
   * Returns the shortest CSS path identify a given element. Note that this
   * will not return a unique element, but can be used to obtain all elements
   * that match the selector returned.
   * 
   * @param cssElement The element that must be identified by its CSS selector.
   * @return The CSS selector for the given element, or the empty string if
   * no selector is found.
   */
  private String cssPath( Element cssElement ) {
    StringBuilder result = new StringBuilder( 256 );

    String id = cssElement.id();

    // If the element has an ID, then return it as the shortest path (IDs are
    // supposed to be unique).
    if( id.length() > 0 ) {
      // This will break the chain of recursion.
      result.append( '#' ).append( id );
    }
    else {
      Element parent = cssElement.parent();

      // If there is a parent node, then recurse to determine its CSS path.
      // Otherwise, the chain of recursion ends here.
      if( parent != null ) {
        result.append( cssPath( parent ) );
      }

      // Generate a CSS path using the element's tag name and classes.
      if( cssElement.className().length() > 0 ) {
        result.append( " > " ).append( cssElement.tagName() );
        Set<String> cssClasses = cssElement.classNames();
        cssClasses.forEach( c -> result.append( '.' ).append( c ) );
        result.append( ' ' );
      }
    }

    // Return the (possibly incomplete) CSS selector through recursion.          
    return result.toString();
  }
Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315