I'm trying to query some HTML to find links which somehow contain the word "download". So it can be in
- the
id
- the
class
- the
href
- the text
- any html within the
a
tag.
So using the Python lxml library it should find all the 7 links in the test-html:
html = """
<html>
<head></head>
<body>
1 <a href="/test1" id="download">test 1</a>
2 <a href="/test2" class="download">test 2</a>
3 <a href="/download">test 3</a>
4 <a href="/test4">DoWnLoAd</a>
5 <a href="/test5">ascascDoWnLoAdsacsa</a>
6 <a href="/test6"><div id="test6">download</div></a>
7 <a href="/test7"><div id="download">test7</div></a>
</body>
</html>
"""
from lxml import etree
tree = etree.fromstring(html, etree.HTMLParser())
downloadElementConditions = "//a[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]"
elements = tree.xpath(downloadElementConditions)
print 'FOUND ELEMENTS:', len(elements)
for i in elements:
print i.get('href'), i.text
If this is run however, it only finds the first five elements. This means that xpath can only find "download" in the text if the text does not contain further html.
Is there a way to consider the contents of the a
tag as a regular string and see if that contains "download"? All tips are welcome!
[EDIT]
using the tips in the answer of heinst below I edited the code below. This now works, but it's not very elegant. Does anybody know a solution in pure xpath?
from lxml import etree
tree = etree.fromstring(html, etree.HTMLParser())
downloadElementConditions = "//*[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]"
elements = tree.xpath(downloadElementConditions)
print 'FOUND ELEMENTS:', len(elements)
for el in elements:
href = el.get('href')
if href:
print el.get('href'), el.text
else:
elparent = el
for _ in range(10): # loop over 10 parents
elparent = elparent.getparent()
href = elparent.get('href')
if href:
print elparent.get('href'), elparent.text
break