Search both elements and attributes for string

Question

I'm trying to query some HTML to find links which somehow contain the word "download". So it can be in

the id
the class
the href
the text
any html within the a tag.

So using the Python lxml library it should find all the 7 links in the test-html:

html = """
<html>
<head></head>
<body>
1 <a href="/test1" id="download">test 1</a>
2 <a href="/test2" class="download">test 2</a>
3 <a href="/download">test 3</a>
4 <a href="/test4">DoWnLoAd</a>
5 <a href="/test5">ascascDoWnLoAdsacsa</a>
6 <a href="/test6"><div id="test6">download</div></a>
7 <a href="/test7"><div id="download">test7</div></a>
</body>
</html>
"""

from lxml import etree

tree = etree.fromstring(html, etree.HTMLParser())
downloadElementConditions = "//a[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]"
elements = tree.xpath(downloadElementConditions)

print 'FOUND ELEMENTS:', len(elements)
for i in elements:
    print i.get('href'), i.text

If this is run however, it only finds the first five elements. This means that xpath can only find "download" in the text if the text does not contain further html.

Is there a way to consider the contents of the a tag as a regular string and see if that contains "download"? All tips are welcome!

[EDIT]

using the tips in the answer of heinst below I edited the code below. This now works, but it's not very elegant. Does anybody know a solution in pure xpath?

from lxml import etree
tree = etree.fromstring(html, etree.HTMLParser())
downloadElementConditions = "//*[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]"
elements = tree.xpath(downloadElementConditions)

print 'FOUND ELEMENTS:', len(elements)
for el in elements:
    href = el.get('href')
    if href:
        print el.get('href'), el.text
    else:
        elparent = el
        for _ in range(10):  # loop over 10 parents
            elparent = elparent.getparent()
            href = elparent.get('href')
            if href:
                print elparent.get('href'), elparent.text
                break

score 2 · Accepted Answer · edited May 23 '17 at 12:04

Pure XPath Solution

Change text() to . and search the descendent-or-self axis for the attributes:

//a[(.|.//@id|.//@class|.//@href)[contains(translate(.,'DOWNLOAD','download'),'download')]]

Explanation:

text() vs .: Here text() will match an immediate text node child of a; . will match the string value of the a element. In order to capture cases where there are child elements of a containing the target text, you want to match the string value of a.
descendant-or-self: In order to match attributes on both the a and any of its descendants, the descendant-or-self axis (.//) is used.

For more details on string values in XPath, see Matching text nodes is different than matching string values.

score 1 · Answer 2 · answered Jan 05 '16 at 14:25

1

Changing your Xpath select from strictly matching a tags to a wildcard should do the trick: "//*[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]"

answered Jan 05 '16 at 14:25

heinst

8,520
7
41
77

Thanks for your suggestion, but when using the wildcard it will find the divs within the `a` element, instead of the `a` element itself. In the end I need the `href` from the `a`, so I really need to find the `a` element. Any other idea? – kramer65 Jan 05 '16 at 14:30
You could get the parent nodes in your for loop and get the `a` tags that way – heinst Jan 05 '16 at 14:44
Thank you, that was a good tip. I managed to get it to work (see added code in my question) but it's not very elegant. Is there no way of doing this using pure xpath? – kramer65 Jan 05 '16 at 15:04

Search both elements and attributes for string

2 Answers2

Pure XPath Solution