6

I'm using Selenium to do some webscraping and I now want to find all elements on which the user can click and which contain the word "download" (in any capitalization) in either the link text, the button text, the element id, the element class or the href. This can include both links, buttons or any other element.

In this answer I found the an xpath for somebody looking for an xpath to search for buttons based on a certain text (or non-case-sensitive and partial matches):

text = 'download'
driver.find_elements_by_xpath("(//*[contains(text(), 'download')]")

but on this page that returns no results, even though the following link is in there:

<a id="downloadTop" class="navlink" href="javascript:__doPostBack('downloadTop','')">Download</a>

Does anybody know how I can find all elements which somehow contain the word "download" in a website?

[EDIT] This question was marked as a duplicate for a question which gets an answer in which it is suggested to change it to "//*[text()[contains(.,'download')]]". So I tried the following:

>>> from selenium import webdriver
>>> d = webdriver.Firefox()
>>> link = 'https://www.yourticketprovider.nl/LiveContent/tickets.aspx?x=492449&y=8687&px=92AD8EAA22C9223FBCA3102EE0AE2899510C03E398A8A08A222AFDACEBFF8BA95D656F01FB04A1437669EC46E93AB5776A33951830BBA97DD94DB1729BF42D76&rand=a17cafc7-26fe-42d9-a61a-894b43a28046&utm_source=PurchaseSuccess&utm_medium=Email&utm_campaign=SystemMails'
>>> d.get(link)
>>> d.find_elements_by_xpath("//*[text()[contains(.,'download')]]")
[]  # As you can see it still doesn't get any results..
>>>

Does anybody know how I can get all elements on which the user can click and which contain the word "download" in either the link text, the button text, the element id, the element class or the href? All tips are welcome!

Community
  • 1
  • 1
kramer65
  • 50,427
  • 120
  • 308
  • 488

8 Answers8

3

Since you need a case-insensitive match and the XPath 1.0 does not support it - you'll have to use translate() function. Plus, since you need a wildcard match - you need to use contains(). And, since you also want to check the id, class and href attributes, as well as a text:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://www.yourticketprovider.nl/LiveContent/tickets.aspx?x=492449&y=8687&px=92AD8EAA22C9223FBCA3102EE0AE2899510C03E398A8A08A222AFDACEBFF8BA95D656F01FB04A1437669EC46E93AB5776A33951830BBA97DD94DB1729BF42D76&rand=a17cafc7-26fe-42d9-a61a-894b43a28046&utm_source=PurchaseSuccess&utm_medium=Email&utm_campaign=SystemMails")

condition = "contains(translate(%s, 'DOWNLOAD', 'download'), 'download')"
things_to_check = ["text()", "@class", "@id", "@href"]
conditions = " or ".join(condition % thing for thing in things_to_check)

for elm in driver.find_elements_by_xpath("//*[%s]" % conditions):
    print(elm.text)

Here we are basically constructing the expression via string formatting and concatenation, making a case insensitive checks for text(), class, id and href attributes and joining the conditions with or.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Note that [@Dimitre's answer](http://stackoverflow.com/a/33947276/771848) makes the dynamic XPath construction in this case completely unnecessary (You should probably accept his answer as the most simple and straightforward). – alecxe Nov 28 '15 at 03:31
3

Try this:

//*[(@id|@class|@href|text())
       [contains(translate(.,'DOWNLOAD','download'), 'download')]]

This Xpath 1.0 expression selects: all elements that have an id or class or href attribute or text-node child, whose string value contains the string "download: in any capitalization.

Here is a running proof. The XSLT transformation below is used to evaluate the XPath expression and to copy all selected nodes to the output:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

  <xsl:template match="/">
    <xsl:copy-of select=
    "//*[(@id|@class|@href|text())
       [contains(translate(.,'DOWNLOAD','download'), 'download')]]
    "/>
  </xsl:template>
</xsl:stylesheet>

When we apply the transformation to the following test-document:

<html>
  <a id="downloadTop" class="navlink" 
    href="javascript:__doPostBack('downloadTop','')">Download</a>
  <b id="y" class="x_downLoad"/>
  <p>Nothing to do_wnLoad</p>
  <a class="m" href="www.DownLoad.com">Get it!</a>
  <b>dOwnlOad</b>
</html>

The wanted elements are selected and then copied to the output:

<a id="downloadTop" class="navlink" href="javascript:__doPostBack('downloadTop','')">Download</a>
<b id="y" class="x_downLoad"/>
<a class="m" href="www.DownLoad.com">Get it!</a>
<b>dOwnlOad</b>
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
1

Well, the answer you found already tells you how to do what you want. The problem I see is that text = 'download' starts with lower case while the text in <a id="downloadTop" class="navlink" href="javascript:__doPostBack('downloadTop','')">Download</a> starts with upper case.

Start by changing your text to text = 'Download' and see if it finds your element now. If that was the problem then you can use a little trick like

text = 'ownload'

driver.find_elements_by_xpath("(//*[contains(text(), '" + text + "')] | //*[@value='" + text + "'])")

to ignore the first character.

EDIT: Yes you can make it case insensitive.

driver.find_elements_by_xpath("(//*[contains(translate(text(), 'DOWNLOAD', 'download'), 'download')])")
Pablo Miranda
  • 369
  • 1
  • 9
  • The thing is that I want to define it case insensitive. So also elements containing an id="DOWNLOAD" or id="dOwNLoAd" and also containing wildcards, such as id="downloadthisstuff", or id="yourdownloadishere". Any ideas how I can do that? – kramer65 Nov 23 '15 at 13:27
0

You can use the translate function as below, it is not case sensetive for any words:

driver.find_elements_by_xpath("//*[translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') = 'download']")

>>> driver.find_elements_by_xpath("//*[translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') = 'download']")
[<selenium.webdriver.remote.webelement.WebElement (session="0b07fcba-86ee-3945-a0ae-85619e97ca31", element="{4278753b-8b59-bf45-ae3b-f60f40aed071}")>, <selenium.webdriver.remote.webelement.WebElement (session="0b07fcba-86ee-3945-a0ae-85619e97ca31", element="{8aed425c-063e-7846-915d-d8948219cc12}")>]
Mesut GUNES
  • 7,089
  • 2
  • 32
  • 49
0

If you still want more generalization of xpath and do not want to use that translate function, you can use itertools.product and generate all variant of the string download as node text-attribute as below.

from  itertools import  product
from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://www.yourticketprovider.nl/LiveContent/tickets.aspx?x=492449&y=8687&px=92AD8EAA22C9223FBCA3102EE0AE2899510C03E398A8A08A222AFDACEBFF8BA95D656F01FB04A1437669EC46E93AB5776A33951830BBA97DD94DB1729BF42D76&rand=a17cafc7-26fe-42d9-a61a-894b43a28046&utm_source=PurchaseSuccess&utm_medium=Email&utm_campaign=SystemMails")
txt = 'Download' # text to be searched
#Generate variants of that txt
l = [(c, c.lower()) if not c.isdigit() else (c,) for c in txt.upper()] #make tuple of upper and lower of each lettern that string (Download)
variants = ["".join(item) for item in product(*l)] # make all variant of the string Download
anchors = ["text()", "@class", "@id", "@href"] #node attribute to be searched
#Generate xpaths
xpaths_or = " or ".join(["contains(%s,'%s')"%(i,j) for i in anchors for j in variants])
xpaths = "//*[%s]" %xpaths_or
for download_tag in driver.find_elements_by_xpath(xpaths):
    print(download_tag.text)
driver.quit()

Output-

Download
Download

N.B. isdigit function to avoid changing case of the numbers if exists.

Learner
  • 5,192
  • 1
  • 24
  • 36
0

but on this page that returns no results, even though the following link is in there:

Its because of there is different text. Look:

Download
download

one letter is in the uppercase. So you need to use case insensitive xpath for this:

driver.find_elements_by_xpath("(//*[contains(lower-case(text()), 'download')]")

its must work good enough for you

Andrew_STOP_RU_WAR_IN_UA
  • 9,318
  • 5
  • 65
  • 101
-3

When using Selenium and finding web elements its better to always search first for "ID" or "Class Name" since its more reliable and easier than using XPath, usually XPath is only used when you cant find your element using the first 2 methods mentioned.

In this case you have a very clear ID tag in the download element of that website.

Try using this instead:

downloadButton = driver.find_element_by_id('downloadTop')

And then you can use this to click it:

downloadButton.click()
Alvaro Bataller
  • 487
  • 8
  • 29
  • The thing is that it is "downloadTop" this time. Since I'm building a scraper however, I want it to be more generic. So I want all elements containing the word "download" case insensitive. So also elements containing an `id="DOWNLOAD"` or `id="dOwNLoAd"` and also containing wildcards, such as `id="downloadthisstuff"`, or `id="yourdownloadishere"`. Any ideas how I can do that? – kramer65 Nov 23 '15 at 13:26
-3

Well, I don't know selenium very well, but I can suggest a solution, that will work. You can use regular expressions to parse entire page source first. For example, if you need just elements with attributes, containing 'download' substring, use this regexp:

<\w*([a-zA-Z]+).*\w+([a-zA-Z]+)="(.*?download.*?)"?\/?>

Then find all mathes with re.finditer function, every match object will contain tag name (group(1)), attribute name (group(2) and attribute value (group(3))

import re

# wd == webdriver

for m in re.finditer('<\w*([a-zA-Z]+).*\w+([a-zA-Z]+)="(.*?download.*?)"?\/?>', wd.page_source):
    tag, attr, val = m.group(1), m.group(2), m.group(3)

Then, you can use wd.find_elements_by_css_selector (or something else) to find all tags in selenium tree structure:

wd.find_elements_by_css_selector('{0}[{1}={2}]'.format(tag, attr, val))
VadimK
  • 535
  • 3
  • 8
  • 6
    [Aaargh ...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – reinierpost Nov 20 '15 at 17:48
  • Yes, you can't parse html and make a tree structure, because it is not a regular language. You can't even find arbitrary number of opening and closing brackets with regular expressions. But I don't want to find closing tags in this case, I just want to find all opening tags in order, and it is totally feasibly with regular expressions. – VadimK Nov 20 '15 at 18:06
  • It's feasible as long as none of these tags occur in other text, e.g. in JavaScript. – reinierpost Nov 20 '15 at 18:08