3

I'm trying to query some HTML to find links which somehow contain the word "download". So it can be in

  1. the id
  2. the class
  3. the href
  4. the text
  5. any html within the a tag.

So using the Python lxml library it should find all the 7 links in the test-html:

html = """
<html>
<head></head>
<body>
1 <a href="/test1" id="download">test 1</a>
2 <a href="/test2" class="download">test 2</a>
3 <a href="/download">test 3</a>
4 <a href="/test4">DoWnLoAd</a>
5 <a href="/test5">ascascDoWnLoAdsacsa</a>
6 <a href="/test6"><div id="test6">download</div></a>
7 <a href="/test7"><div id="download">test7</div></a>
</body>
</html>
"""

from lxml import etree

tree = etree.fromstring(html, etree.HTMLParser())
downloadElementConditions = "//a[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]"
elements = tree.xpath(downloadElementConditions)

print 'FOUND ELEMENTS:', len(elements)
for i in elements:
    print i.get('href'), i.text

If this is run however, it only finds the first five elements. This means that xpath can only find "download" in the text if the text does not contain further html.

Is there a way to consider the contents of the a tag as a regular string and see if that contains "download"? All tips are welcome!

[EDIT]

using the tips in the answer of heinst below I edited the code below. This now works, but it's not very elegant. Does anybody know a solution in pure xpath?

from lxml import etree
tree = etree.fromstring(html, etree.HTMLParser())
downloadElementConditions = "//*[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]"
elements = tree.xpath(downloadElementConditions)

print 'FOUND ELEMENTS:', len(elements)
for el in elements:
    href = el.get('href')
    if href:
        print el.get('href'), el.text
    else:
        elparent = el
        for _ in range(10):  # loop over 10 parents
            elparent = elparent.getparent()
            href = elparent.get('href')
            if href:
                print elparent.get('href'), elparent.text
                break
kjhughes
  • 106,133
  • 27
  • 181
  • 240
kramer65
  • 50,427
  • 120
  • 308
  • 488

2 Answers2

2

Pure XPath Solution

Change text() to . and search the descendent-or-self axis for the attributes:

//a[(.|.//@id|.//@class|.//@href)[contains(translate(.,'DOWNLOAD','download'),'download')]]

Explanation:

  • text() vs .: Here text() will match an immediate text node child of a; . will match the string value of the a element. In order to capture cases where there are child elements of a containing the target text, you want to match the string value of a.
  • descendant-or-self: In order to match attributes on both the a and any of its descendants, the descendant-or-self axis (.//) is used.

For more details on string values in XPath, see Matching text nodes is different than matching string values.

Community
  • 1
  • 1
kjhughes
  • 106,133
  • 27
  • 181
  • 240
1

Changing your Xpath select from strictly matching a tags to a wildcard should do the trick: "//*[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]"

heinst
  • 8,520
  • 7
  • 41
  • 77
  • Thanks for your suggestion, but when using the wildcard it will find the divs within the `a` element, instead of the `a` element itself. In the end I need the `href` from the `a`, so I really need to find the `a` element. Any other idea? – kramer65 Jan 05 '16 at 14:30
  • You could get the parent nodes in your for loop and get the `a` tags that way – heinst Jan 05 '16 at 14:44
  • Thank you, that was a good tip. I managed to get it to work (see added code in my question) but it's not very elegant. Is there no way of doing this using pure xpath? – kramer65 Jan 05 '16 at 15:04