0

I am writing a webscraper that goes through list of links from a CSV file and scrapes details from each of them. However I'm having trouble pointing to an element that holds the email addresses that I am trying to scrape. If you look at [https://reality.idnes.cz/rk/detail/m-m-reality-holding-a-s/5a85b582a26e3a321d4f2700/] you can see that there is a company name, address, phone number and an email. The email is the element I'm having problem with. If you look into the code of the website you will quickly notice that both phone number and the email have the same header class of "item-icon". If you look into my code you can see that I tried to refer to the actual class using the nth child which didn't work also for some reason. The result doesn't get printed and put into the CSV file hence is not found. This is my code I'm having trouble with :

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.chrome.options import Options
import time
import csv

with open('links.csv') as read:
    reader = csv.reader(read)
    link_list = list(reader)
    with open('ScrapedContent.csv', 'w+', newline='') as write:
        writer = csv.writer(write)
        options = Options()
        options.add_argument('--no-sandbox')
        path = "/home/kali/Desktop/SRealityContentScraper/chromedriver"
        driver = webdriver.Chrome(path)
        wait = WebDriverWait(driver, 10)
        for link in link_list:
            driver.get(', '.join(link))
            time.sleep(2)
            information_list = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "h1.b-annot__title.mb-5")))
            title = driver.find_element_by_css_selector("h1.b-annot__title.mb-5")
            information_list = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "span.btn__text")))
            offers = driver.find_element_by_css_selector("span.btn__text")
            information_list = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "p.font-sm")))
            addresses = driver.find_element_by_css_selector("p.font-sm")
            try:
                information_list = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "a.item-icon.measuring-data-layer")))
                phone_number = driver.find_element_by_css_selector("a.item-icon.measuring-data-layer")
            except Exception:
                pass
            try:
                information_list = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR,"span.items:nth-of-type(2) span.items__item a.item-icon")))
                email = driver.find_element_by_css_selector("span.items:nth-of-type(2) span.items__item a.item-icon")
            except Exception:
                pass
            try:
                phone_number = phone_number.text
            except Exception:
                phone_number = " "
                pass
            try:
                email = email.text
            except Exception:
                email = " "
                pass
            print(title.text, " ", offers.text, " ", addresses.text, " ", phone_number, " ", email)
            writer.writerow([title.text, offers.text, addresses.text, phone_number, email])

        driver.quit()

The reason why there are try loops in the code is that sometimes one of the pages in the link list is missing either email or phone number. And so I made it so that if that happens the place of the information is filled with " " empty string. However the information doesn't get printed even if it's present on the page which leads me to believe that the element wasn't found properly. I removed the loops to test the output and indeed Selenium confirms that the element could not be found. Without the nth child the scraper instead scrapes 2 phone numbers instead of phone number and an email. This to my understanding is caused by Selenium always looking for the first element of that CSS selector on the page which will always be the phone number.

My question is how can I properly point to the element so that the emails get scraped properly? Thanks for any help with this! I am starting to feel hopeless...

541daw35d
  • 141
  • 2
  • 12

2 Answers2

0

I will try to help you, my solution is using Xpaths as selector:

How it works -> "//a[./span[contains(@class, 'icon icon--email')]]"

//a -> any a

[./span] -> child span inside a

[contains(@class, 'icon icon--email')] -> class with that string contained,

    xpath_phone = "//a[./span[contains(@class, 'icon icon--phone')]]"

    xpath_email = "//a[./span[contains(@class, 'icon icon--email')]]"

    #example for email
    try:
        information_list = wait.until(ec.presence_of_element_located((By.XPATH, xpath_email)))
        email = driver.find_element_by_xpath(xpath_email)
     except Exception:
         pass
Wonka
  • 1,548
  • 1
  • 13
  • 20
  • This solution worked! I know of XPath but you seem to be using more advanced type of them. And special thank you for explaining how does the XPath works in my case! – 541daw35d Nov 19 '20 at 11:29
0

To print the email address i.e. agorniak@mmreality.cz you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.items__item>a[href*='@']"))).text)
    
  • Using XPATH:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[@class='items__item']/a[contains(@href, '@')]"))).text)
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352