1

So I have a basic table structured:

<tbody>
  <tr>
    <td></td>
    <td></td>
  <tr>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td></td>
    <td></td>
  </tr>
</tbody> etc....

I'm trying to target a link <a> element in one cell only if the cell above it does NOT contain a word. For example:

<tbody>
  <tr>
    <td></td>
    <td></td>
  <tr>
  <tr>
    <td></td>
    <td><b>Fire Sale!</b></td>
  </tr>
  <tr>
    <td></td>
    <td><a href="something">linktext</a></td>
  </tr>
/tbody>

So I'd want to target the <a> only if the cell above it does NOT contain "Fire Sale!".

The problem is no matter what I do I can't keep the conditional axes to find the cell right above.

<tbody>
  <tr>
    <td></td>
    <td><b>Fire Sale!</b></td>
  </tr>
  <tr>
    <td><a href="somethingelse">link I don't want</a></td>
    <td><a href="something">linktext</a></td>
  </tr>
/tbody>

I've tried stuff like:

//tr/td/b/a[@href]/ancestor::tbody/tr/td/b[contains(text(),'Fire Sale!')]

But no matter what, because of the odd relationship between tr and td I always end up getting an affirmative conditional. That is, they share the same ancestor tree structure for the most part and targeting back down to the <td> above my main target seems impossible. Is there some way to use variables or I feel count() might help but I'm just not sure of the syntax for the whole thing.

Any ideas?

EDIT: Here is the real HTML

<table border="0" width="100%" style="border-collapse: collapse">
    <tr>
        <td width="33%" valign="top" height="225" align="center"><img border="0" src="" width="296" height="225"></td>
        <td width="33%" valign="top" height="225" align="center"><br><br><br><br><b>Unassigned</b></td>
        <td width="33%" valign="top" height="225" align="center"></td>
    </tr>
    <tr>
        <td width="33%" valign="top" height="30" align="center"><b><a href="">AAAAA</a></b><br>
                <b>XXXXXXX</b><br><b><font color="#FF0000">YYYYYYYYY</font><br></b><br></td>
        <td width="33%" valign="top" height="30" align="center"><b><a href="">BBBBB</a><br></b><br></td>
        <td width="33%" valign="top" height="30" align="center"></td>
    </tr>
    <tr>
        <td width="100%" colspan="4" height="80" align="center">
        | <a href=""> Home</a> |<br>
        | <a href="">Design</a> 
        | <a href="">Styles</a> 
        | <a href="">X Listings</a> 
        | <a href="">Y Listings</a> |<br>
        | <a href="">About the Author</a> |</td>
    </tr>
    <tr>
        <td width="100%" colspan="4" height="60" align="center">
        Copyright Some Dude, 2020<br>
        Email: <a href="">someperson@somewhere.com</a></td>
    </tr>
</table>

So basically I want the link containing BBBBB only if the word 'Unassigned' does not appear above it.

EDIT 2 to clarify that the links should only be targeted when text in the above cell does NOT exist.

Lennon
  • 43
  • 4
  • The term `above` is somewhat unclear. Does above mean visual? Or dos it mean in the same row a previous td? I.e. in your real HTML example, should it be get a result or not? – Siebe Jongebloed Jun 01 '22 at 07:38
  • Above means visually yes. In technical terms the in the "above" so visually it is the cell above the cell containing the target – Lennon Jun 01 '22 at 13:53

4 Answers4

1

Try the following somewhat complex XPath-1.0 expression. It will give you <a> links' href attribute for the preceding row's cell index containing a given string:

//tr/td[count(../preceding-sibling::tr[1]/td[contains(.,'Fire Sale!')]/preceding-sibling::td)+1]/a/@href

EDIT1:
A stricter version that selects the link if the new given value "Unassigned" is present is the following:

//tr[preceding-sibling::tr[1]/td[contains(.,'Unassigned')]]/td[count(../preceding-sibling::tr[1]/td[contains(.,'Unassigned')]/preceding-sibling::td)+1]//a
zx485
  • 28,498
  • 28
  • 50
  • 59
  • This definitely looks on the right track though I'm still not able to match but I think I just need to massage it a bit more which I'm working on. However, I was wondering what the '.' as the first parameter of contains() does? Is that like the regex 'any'? – Lennon May 31 '22 at 23:07
  • The difference is explained [here at SO](https://stackoverflow.com/q/38240763/1305969). In case of simple text nodes, `text()` will work, too. – zx485 May 31 '22 at 23:28
  • I think I found the problem. In my example there are only two rows and so the tr[1] within the count function works as a reference. But it breaks down when there are multiple or variable amounts of other rows above (as preceding-siblings). It needs to reference the immediate row above it which may or may not be the actual first row(tr) of any given table since the tables sizes (amount of trs) vary. – Lennon Jun 01 '22 at 00:07
  • In my expression, the `[1]` refers to the list of preceding siblings: it selects the _first_ `preceding-sibling::tr`, vulgo the direct preceding tr node. – zx485 Jun 01 '22 at 00:12
  • Yes, you are correct sir, my mistake. So the problem is me #$%@ the original question. I needed to grab the cells that do NOT have certain text, not those that contain text. It appears when flipping around the condition the logic of the count() doesn't work anymore. Going to fix my derped post. – Lennon Jun 01 '22 at 00:20
  • So in the latest version it looks like rows (tr) with 'Unassigned' are selected but this then disregards every other row. The reverse, which is the actual desire, (not(contains(.,'Unassigned))) has the opposite effect of flagging entire rows that have it, so they're not selected, leaving links in cells/tds that are perfectly valid. The selection I think needs to be more granular on the s since any cell of the 3 columns may have the term, thus should be disregarded, just that one cell, not the whole row. My own testing I either disregard the wrong rows or columns, but can't align on target. – Lennon Jun 01 '22 at 01:37
0

You might first get the tbody/tr/td/b that contains Fire Sale! and then navigate to the next tr through the ancestor tr.

Note that in your expression this part //tr/td/b/a[@href] would not match as there is no anchor wrapped in a b tag in the example data.

//tbody/tr/td/b[contains(text(),'Fire Sale!')]/ancestor::tr/following-sibling::tr[1]/td/a[@href]
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • This almost works but I actually need it to NOT contain that word, specifically 'Unassigned' like in my updated example. However, when I try and do that using your example as a base it ends up with 0 targets for some reason. Seems like it should work though...? – Lennon Jun 01 '22 at 14:04
  • @Lennon Do you mean like this? `//tbody/tr/td[2]/b[not(contains(text(),'Unassigned'))]/ancestor::tr/following-sibling::tr/td[2]/b/a[@href]` – The fourth bird Jun 01 '22 at 14:34
  • the td[2] is too specific. The xpath is used in a loop to scrape data from a page that may have anywhere from 1 to ~27 "cells"(td). Any of them can have that word so it needs to be dynamic in the sense: detect where that word exists and then ignore targeting the links() underneath but grab all others. – Lennon Jun 01 '22 at 14:45
0

I'll use a more simplified search-text ("xx") while try a bit more complex element structure in order to prove my approach.

Using this input:

<tbody>
  <tr>
    <td></td>
    <td><b>xx</b></td>
  </tr>
  <tr>
    <td><a href="somethingelse">link I don't want</a></td>
    <td><a href="something">linktext</a></td>
  </tr>
  
  <tr>
    <td></td>
    <td>xx</td>
    <td>  </td>
    <td>xx</td>
  </tr>
  <tr>
    <td><a href="nok">don't want it</a></td>
    <td><a href="ok">want it</a></td>
    <td><a href="nok">don't want it</a></td>
    <td><a href="ok">want it</a></td>
  </tr>
  
</tbody>

and applying this XPath expression:

    //td[a and count(preceding-sibling::td) = 
parent::tr/preceding-sibling::tr[1]/td[.//text() = 'xx']/count(preceding-sibling::td)]/a

I get the three wanted <a>'s. Idea is to count the number of <td>s before "me" and check whether in line above (parent::tr/preceding-sibling::tr[1]) a <td> exists that contains the search string and has the same number of <td>'s before it.

leu
  • 2,051
  • 2
  • 12
  • 25
  • Are the spaces around the '=' intentional? I'm getting errors in SelectorsHub when using that code. Also, I updated/fixed the post. I actually need to grab the links when the cell above it does NOT contain a certain text. – Lennon Jun 01 '22 at 01:45
  • This actually looks like the answer if I can just get the test to work in my Xpath tester! Getting a syntax error about the '::' part needing to have tagName after it...but every occurrence does so wth SelectorsHub? – Lennon Jun 01 '22 at 02:09
  • I tested in Oxygen and the code worked. Cannot say much about your XPath tester, sorry. – leu Jun 01 '22 at 07:46
0

Since we cannot use the current()-function in XPath the only solution i see is to hard-code the position of td's.

I.e. to test the second column in this HTML(like your real HTML):

<table border="0" width="100%" style="border-collapse: collapse">
  <tr>
    <td width="33%" valign="top" height="225" align="center">
      <img border="0" src="" width="296" height="225"/>
    </td>
    <td width="33%" valign="top" height="225" align="center">
      <br/>
      <br/>
      <br/>
      <br/>
      <b>Unassigned</b>
    </td>
    <td width="33%" valign="top" height="225" align="center"/>
  </tr>
  <tr>
    <td width="33%" valign="top" height="30" align="center">
      <b>
        <a href="">AAAAA</a>
      </b>
      <br/>
      <b>XXXXXXX</b>
      <br/>
      <b>
        <font color="#FF0000">YYYYYYYYY</font>
        <br/>
      </b>
      <br/>
    </td>
    <td width="33%" valign="top" height="30" align="center">
      <b>
        <a href="">BBBBB</a>
        <br/>
      </b>
      <br/>
    </td>
    <td width="33%" valign="top" height="30" align="center"/>
  </tr>
  <tr>
    <td width="100%" colspan="4" height="80" align="center"> | <a href=""> Home</a> |<br/> | <a href="">Design</a> | <a href="">Styles</a> | <a href="">X Listings</a> | <a href="">Y Listings</a> |<br/> | <a href="">About the Author</a> |</td>
  </tr>
  <tr>
    <td width="100%" colspan="4" height="60" align="center"> Copyright Some Dude, 2020<br/> Email: <a href="">someperson@somewhere.com</a></td>
  </tr>
</table>

the XPath would be this:

/table/tr/td[2][not(parent::tr/preceding-sibling::tr[1]/td[2][contains(.,'Unassigned')] )]//a[text()='BBBBB']

Will give successfully no result

To test the third column just change it to this:

/table/tr/td[3][not(parent::tr/preceding-sibling::tr[1]/td[3][contains(.,'Unassigned')] )]//a[text()='BBBBB']

etc.

Siebe Jongebloed
  • 3,906
  • 2
  • 14
  • 19