1

This is not the same question as xpath find specific link in page . I've got <a href="http://example.com">foo <em class="bar">baz</em>.</a>. and need to find the link by the full foo <em class="bar">baz</em>. including the closing dot.

Community
  • 1
  • 1
chx
  • 11,270
  • 7
  • 55
  • 129

3 Answers3

1

In my understanding XPath can't see the raw HTML markup, it works on the abstracted layer of the HTML document. Trying to incorporate as much information the HTML markup contains to an XPath expression would yield something like this :

//a[
    node()[1][self::text() and .='foo ']
    /following-sibling::node()[1][self::em[@class='bar' and .='baz']]
    /following-sibling::node()[1][self::text() and .='.']
]

brief explanation about the predicate being used :

  • node()[1][self::text() and .='foo '] : having first child node a text node with value equals "foo"
  • /following-sibling::node()[1][self::em[@class='bar' and .='baz']] : followed directly by <em> having class equals "bar" and value equals "baz"
  • /following-sibling::node()[1][self::text() and .='.'] : followed directly by a text node having value equals "."
har07
  • 88,338
  • 12
  • 84
  • 137
1

Note: I'm following up on OP's comment

A (visually) simpler variation of OP's own answer could be:

//a[. = "foo baz."][em[@class = "bar"] = "baz"]

or even:

//a[.="foo baz." and em[@class="bar"]="baz"]

(assuming you want to select the <a> node, and not the child <em>)

Regarding OP's question:

why the [em[]= doesn't need the dot?

Inside a predicate, testing = against a string on the right will convert the left part to a string, here <em> to its string representation, i.e. what string() would return.

XPath 1.0 specification document has an example of this:

chapter[title="Introduction"] selects the chapter children of the context node that have one or more title children with string-value equal to "Introduction"

Later, the same spec says on boolean tests:

If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true.

In OP's answer, //a[string() = 'bar baz.']/em[@class='bar' and .='baz'], the . is needed since the test on 'baz' is on the context node

Note that my answer is somewhat naive and assumes there's only 1 <em> child of <a>, because [em[@class="bar"]="baz"] is looking for one em[@class="bar"] matching the string-value condition, not that it's the only or first one.

Consider this input (a second <em class="bar"> child, but empty):

<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.

and this test using Scrapy selectors

>>> import scrapy
>>> s = scrapy.Selector(text="""<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.""")
>>> s.xpath('//a[.="foo baz." and em[@class="bar"]="baz"]').extract_first()
u'<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>'
>>> 

The XPath matches but you may not want this.

paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
0

This is not 100% because there can be other HTML tags we have stripped by calling string() but for my purposes this looks enough:

//a[string() = 'bar baz.']/em[@class='bar' and .='baz']
chx
  • 11,270
  • 7
  • 55
  • 129