xpath find link containing HTML in page

Question

This is not the same question as xpath find specific link in page . I've got <a href="http://example.com">foo baz.</a>. and need to find the link by the full foo baz. including the closing dot.

score 1 · Answer 1 · answered Jul 15 '15 at 02:29

In my understanding XPath can't see the raw HTML markup, it works on the abstracted layer of the HTML document. Trying to incorporate as much information the HTML markup contains to an XPath expression would yield something like this :

//a[
    node()[1][self::text() and .='foo ']
    /following-sibling::node()[1][self::em[@class='bar' and .='baz']]
    /following-sibling::node()[1][self::text() and .='.']
]

brief explanation about the predicate being used :

node()[1][self::text() and .='foo '] : having first child node a text node with value equals "foo"
/following-sibling::node()[1][self::em[@class='bar' and .='baz']] : followed directly by  having class equals "bar" and value equals "baz"
/following-sibling::node()[1][self::text() and .='.'] : followed directly by a text node having value equals "."

score 1 · Accepted Answer · answered Jul 16 '15 at 10:40

Note: I'm following up on OP's comment

A (visually) simpler variation of OP's own answer could be:

//a[. = "foo baz."][em[@class = "bar"] = "baz"]

or even:

//a[.="foo baz." and em[@class="bar"]="baz"]

(assuming you want to select the <a> node, and not the child )

Regarding OP's question:

why the [em[]= doesn't need the dot?

Inside a predicate, testing = against a string on the right will convert the left part to a string, here  to its string representation, i.e. what string() would return.

XPath 1.0 specification document has an example of this:

chapter[title="Introduction"] selects the chapter children of the context node that have one or more title children with string-value equal to "Introduction"

Later, the same spec says on boolean tests:

If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true.

In OP's answer, //a[string() = 'bar baz.']/em[@class='bar' and .='baz'], the . is needed since the test on 'baz' is on the context node

Note that my answer is somewhat naive and assumes there's only 1  child of <a>, because [em[@class="bar"]="baz"] is looking for one em[@class="bar"] matching the string-value condition, not that it's the only or first one.

Consider this input (a second  child, but empty):

<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.

and this test using Scrapy selectors

>>> import scrapy
>>> s = scrapy.Selector(text="""<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.""")
>>> s.xpath('//a[.="foo baz." and em[@class="bar"]="baz"]').extract_first()
u'<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>'
>>>

The XPath matches but you may not want this.

score 0 · Answer 3 · answered Jul 15 '15 at 02:34

0

This is not 100% because there can be other HTML tags we have stripped by calling string() but for my purposes this looks enough:

//a[string() = 'bar baz.']/em[@class='bar' and .='baz']

answered Jul 15 '15 at 02:34

chx

11,270
7
55
129

You could even write `//a[.="foo baz."][em[@class="bar"]="baz"]` (selecting the `` node) – paul trmbrth Jul 15 '15 at 21:38
@paultrmbrth if that would be an answer I would accept it. – chx Jul 16 '15 at 03:38
But also why does this work, why the `[em[]=` doesn't need the dot? – chx Jul 16 '15 at 07:08

xpath find link containing HTML in page

3 Answers3