Getting multiple Href's from a xpath text

Question

Here is the deal: i have a website that i want to extract some Href's, especifically the ones that have the text "LEIA ESTA EDIÇÃO", like in this HTML.

<a href="http://acervo.estadao.com.br/pagina/#!/20120824-43410-spo-1-pri-a1-not/busca/ministro+Minist%C3%A9rio" title="LEIA ESTA EDIÇÃO" style="" class="" xpath="1">LEIA ESTA EDIÇÃO</a>

this is the code i have, it's pretty wrong, i was making some tests to see if it work. By the way: It has to be selenium.

driver = webdriver.Chrome()
x = 1


while True:

    try:

    link = ("http://acervo.estadao.com.br/procura/#!/ministro%3B minist%C3%A9rio|||/Acervo/capa//{}/2000|2010|2010///Primeira").format(x)
    driver.get(link)
    time.sleep(1)
    xpath = "//a[contains(text(),'LEIA ESTA EDIÇÃO')]"
    links = driver.find_elements_by_xpath(xpath)
    bw=('')
    for link in links:
        bw += link._element.get_attribute("href")
        print (bw)  

    x = x + 1

    time.sleep(1)

except NoSuchElementException:
    pass

print(x)
time.sleep(1)

score 3 · Answer 1 · answered Mar 21 '18 at 15:14

3

You can try below code to get required output:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver.get(link)
links = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.LINK_TEXT, "LEIA ESTA EDIÇÃO")))
references = [link.get_attribute("href") for link in links]

answered Mar 21 '18 at 15:14

Andersson

51,635
17
77
129

Thank you, this was perfect! – Luís Henrique Martins Mar 21 '18 at 15:44
Just one doubt, is it possible to make it be generated in a formate of a list? – Luís Henrique Martins Mar 21 '18 at 15:50
`references` is a list. But list for current page only. Do you want to get single list of references for all pages? – Andersson Mar 21 '18 at 15:52
Yes, exactly, for all pages. in a txt document, without the coma. – Luís Henrique Martins Mar 21 '18 at 16:40
You can define `references = []` outside loop, on each iteration do `references.extend([link.get_attribute("href") for link in links])` and after loop you can write `"\n".join(references)` into text doc – Andersson Mar 21 '18 at 16:43
I didn't quite get the part where i can get all the results from the variable in one document. I did got how to make it be putted in several lines – Luís Henrique Martins Mar 21 '18 at 17:08
`f.write("\n".join(references))`. `f` is text doc, e.g. `f = open("/some/file.text", "w")` – Andersson Mar 21 '18 at 17:09

o-vexler · Answer 2 · 2018-03-21T15:40:59.010

1

I would really recommend you to read the selenium docs, the explanations over there are easy and straightforward.

There are some places your code can be improved:

Your really do not need the while True. Just think about it, once you extracted all of the links you are done.
The try/except is not correctly indented.
You should get a list of links and extract the text hrefs out of them. A simple 1 liner can be (if there is at least 1 a tag with that text):
```
[a_tag.get_attribute('href') for a_tag in driver.find_elements_by_link_text("LEIA ESTA EDIÇÃO")]
```
The bw: It will become 1 concatenated string of all of the hrefs, I am pretty sure that it is not what you are looking for but rather a list or other data structure.
I Would recommend reading this answer about string concatenation in python.
1. Overall it seems like you can improve you python. I would really recommend getting more comfortable with the language and flow before jumping into selenium :)

edited Mar 21 '18 at 15:40

answered Mar 21 '18 at 15:26

o-vexler

31
1
4

There are multiple pages, i need the While True, there are exactly 57 pages of content. That's why the X = X + 1 The BW is because i'll put each amount of link in one line of excel, so it's good to have it. – Luís Henrique Martins Mar 21 '18 at 15:37
WebElement has no attribute `get()` – Andersson Mar 21 '18 at 15:39
Fixed to get_attribute, thank you. About the while True- sounds like you should iterate over the list of your links. Your code didn't show that. – o-vexler Mar 21 '18 at 15:41
@LuísHenriqueMartins , you can get complete number of pages with `page_number = driver.find_element_by_class_name("page-ultima-qtd").text` and iterate with for loop as `for i in range(int(page_number)):` instead of `while` loop. Note that your `while` loop doesn't have a `break`, so loop will be endless – Andersson Mar 21 '18 at 15:44
I know about this problem, and thanks for the solution. But this code is only to get the links, i don't have any problem if after it get all the links it just gives an error. After that i'll have to pass the links throw a OCR program to make the rest of the code, so it's okay if it just breaks at the end. – Luís Henrique Martins Mar 21 '18 at 15:46

Getting multiple Href's from a xpath text

2 Answers2