0

Here is the link of website from where I want to extract data, I'm trying to get all text of href attribute under anchor tag. Here is the sample html:

<div id="borderForGrid" class="border">
  <h5 class="">
    <a href="/products/product-details/?prod=30AD">A/D TC-55 SEALER</a>
  </h5>

<div id="borderForGrid" class="border">
  <h5 class="">
    <a href="/products/product-details/?prod=P380">Carbocrylic 3356-1</a>
 </h5>

I want to extract all text values like ['A/D TC-55 SEALER','Carbocrylic 3356-1'].
I tried with:

target = driver.find_element_by_class_name('border')
anchorElement = target.find_element_by_tag_name('a')
anchorElement.text

but it gives '' (empty) string.

Any suggestion on how can it be achieved?

PS - Select first value of radio button under PRODUCT TYPE

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Andre_k
  • 1,680
  • 3
  • 18
  • 41
  • Your code works for me, it returned the first value `A/D FIREFILM III`. By the way to get all the values you need to use `find_elements_by_class_name` and iterate over the result list. – Guy Jun 04 '19 at 07:57

3 Answers3

2

Looks like when the website is first loaded all products are loaded as well. The pagination at the bottom does not actually change to different pages. Therefore you are able to extract all products on the very first request of http://www.carboline.com/products/. I used python requests to fetch the websites HTML and lxml html to parse the HTML.

I would stay away from selenium, etc.. if possible (sometimes you have no choice). But if the website is super simple like the one in your question. Then I would recommend just making a request. This avoids having to use a browser with all the extra overhead because you are only requesting what you need.

**I updated my answer to also show you how you can extract the href and text at the same time.

import requests

from lxml import html

BASE_URL = 'http://www.carboline.com'

def extract_data(tree):
    elements = [
        e
        for e in tree.cssselect('div.border h5 a')
        if e.text is not None
    ]
    return elements

def build_data(data):
    dataset = []

    for d in data:
        link = BASE_URL + d.get('href')
        title = d.text

        dataset.append(
            {
                'link':link,
                'title':title
            }
        )

    return dataset

def request_website(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    }
    r = requests.get(url, headers=headers)
    return r.text

response = request_website('http://www.carboline.com/products/')
tree = html.fromstring(response)
data = extract_data(tree)
dataset = build_data(data)
print (dataset)
antfuentes87
  • 849
  • 3
  • 17
  • 34
1

If you need all links values you should be using find_elements_.... functions, not find_element_... functions as the latter one will return you first single match.

Recommended update for your code:

driver.get("http://www.carboline.com/products/")
for link in driver.find_elements_by_xpath("//ul[@id='productList']/descendant::*/a"):
    if link.is_displayed():
        print(link.text)

More information:

Dmitri T
  • 159,985
  • 5
  • 83
  • 133
  • Your solution gives the TEXT Visible on web, but I need the text that is in html..please refer the expected output. – Andre_k Jun 04 '19 at 09:22
1

To extract all the text values within the <a> tags e.g. ['A/D TC-55 SEALER','Carbocrylic 3356-1'], you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following solutions:

  • Using CSS_SELECTOR:

    print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li.topLevel[data-types='Acrylics'] h5>a[href^='/products/product-details/?prod=']")))])
    
  • Using XPATH:

    print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//li[@class='topLevel' and @data-types='Acrylics']//h5[@class]/a[starts-with(@href, '/products/product-details/?prod=')]")))])
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • when I use your soultion of using Xpath for other **PRODUCT TYPE** i.e for `Alkyds` why does it gives `TimeoutException error` code I tried--->`print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//li[@class='topLevel' and @data-types='Alkyds']//h5[@class]/a[starts-with(@href, '/products/product-details/?prod=')]")))])` – Andre_k Jun 04 '19 at 09:47
  • I am not sure about the other **PRODUCT TYPE** i.e for `Alkyds`. Can you raise a new question for your new requirement please? – undetected Selenium Jun 04 '19 at 09:48
  • added a new Question **https://stackoverflow.com/questions/56441689/unable-to-extract-all-href-text-python-selenium** – Andre_k Jun 04 '19 at 10:05
  • I'm still unable to solve my problem mentioned in the new post. Any suggestion on that – Andre_k Jun 04 '19 at 11:32