1
<div class="exclude myclass moreclass">
    <div>
        <a href=""></a>
    </div>
</div>
<div class="otherclass">
  <a href=""></a>
</div>
<div class="myclass">
  <a href=""></a>
</div>

I need to select all the <a> elements that do not have ancestors (several levels up) with class "exclude". How can I do it with jsoup?

BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
derrdji
  • 12,661
  • 21
  • 68
  • 78

1 Answers1

2

With selectors, this is a non-trivial problem, especially if you start from the <a> elements and work your way up. Due to the nature of descendant selectors, using :not() with descendant selectors does not always work as expected. In other words, something like

doc.select("div:not(.exclude) a");

won't work because in your first example, your intermediate classless <div> is :not(.exclude), and for the rest of your elements any <div> elements higher up may also match :not(.exclude).

One very simple workaround is to do it in two separate steps, which means using two separate selectors:

  1. Select all <a> elements.
  2. Then, manually exclude the ones that do have an ancestor with the specific class.

In CSS, this is achieved using an overriding rule. In jsoup, you use the not() method (note: this is different from the :not() pseudo-class in that the latter doesn't currently accept complex selectors unless jsoup implements it differently):

doc.select("a").not(".exclude a");

If you're being restricted to a single selector for whatever reason, then you may be in a fix. You will have to look at the HTML you're working with and see if you can construct a selector based on the information you have about the structure of the HTML. For example, you can look at the parent(s) of those elements that may have the class exclude. Will the class only appear on these top-level <div> elements that share the same parent? If so, you can use a child selector to anchor the :not(.exclude) to the parent:

doc.select("#parent > div:not(.exclude) a");

Although a descendant selector is still used to locate the <a> element from div:not(.exclude), it will never match the classless element because the classless element is not a child of this hypothetical parent element.

If you cannot make any assumptions based on the structure about where the exclude class may appear, and you can't exclude the undesired elements separately, then there isn't much of a solution to this problem.

Community
  • 1
  • 1
BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356