Background
Many questions ask how to obtain a particular DOM element given a CSS selector. This question is about the opposite direction. A document is parsed with jsoup, but could easily be converted to any of:
Use Case
For a particular problem domain (e.g., chemical compounds), thousands of web pages list chemicals in similar ways, but the mark-up differs across web sites. For example:
<div id="chemical-list">
<div class="compound">
<span class="compound-name">water</span>
<span class="compound-periodic">H2O</span>
</div>
<div class="compound">
<span class="compound-name">sodium hypochlorite</span>
<span class="compound-periodic">NaClO</span>
</div>
</div>
Another site might list them differently:
<ul class="chemical-compound">
<li class="chem-name">water, H2O</li>
<li class="chem-name">sodium hypochlorite, NaClO</li>
</ul>
Yet another site might, again, use different markup:
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr><td>water</td><td>H2O</td></tr>
<tr><td>sodium hypochlorite</td><td>NaClO</td></tr>
</tbody>
</table>
A few sample pages from each of the thousands of sites are downloaded. Then, using an existing list of chemicals, it is relatively simple to retrieve a list of candidate web page elements. Using jsoup, this is as simple as:
Elements elements = chemicals.getElementsMatchingOwnText( chemicalNames );
This will allow for high-precision analysis across thousands of pages. (The page can discuss the applications for water and sodium hypochlorite, but only the list is being analyzed.) Knowing the CSS will greatly simplify the analysis and increase its accuracy.
The alternative is to process the entire page looking for "groups" of chemicals, then try to extract the list. Both problems are difficult, but using a CSS selector to jump to the exact spot in the page is far more efficient, and likely far more accurate. Both problems will require some hand-crafting, but I'd like to automate away as much as possible.
Problem
The aforementioned APIs do not appear to have methods that generate a CSS selector given an Element instance (the more unique the better). It is possible to iterate through the parent elements and generate the selector manually. This has been demonstrated using JavaScript in a few questions. There are also answers for generating an XPath, and it might be possible using Selenium.
Specifically, how would you do something like:
String selector = element.getCSSPath();
Elements elements = document.select( selector );
This would:
- Return the CSS selector for the given element.
- Search a document for the given CSS selector.
- Return a list of elements that match the selector.
The second line is not an issue; the first line is problematic.
Question
What API can generate a CSS selector (as unique as possible) from a DOM element?
If there is no existing API, then that would be nice to know.