1

Hello I am trying to scrape some data from a website that has data in its 'dl' tag here is how the website structure looks

<div class="ecord-overview col-md-5">
<h2><span itemprop="name">Donald Duck</span></h2>
dl class="row">
</dd>
<dt class="col-md-4">Email</dt>
<dd class="col-md-8">myemail.com</dd>
</dl>
<div class="ecord-overview col-md-5">
<h2><span itemprop="name">Mickey mouse</span></h2>
dl class="row">
</dd>
<dt class="col-md-4">Email</dt>
<dd class="col-md-8">youremail.com</dd>
</dl>
... data goes on but value differs 

To scrape this i am using selenium:

my code for scraping

for element in driver.find_elements_by_class_name('ThatsThem-record-overview'): # here im scraping name
   #print(Style.RESET_ALL)
   print(Fore.RED + element.text + Style.RESET_ALL)
   #print(Style.RESET_ALL)
   time.sleep(1)
   dl= driver.find_element_by_tag_name('dl') # scraping data under dl tag 
   print(dl.text)
   print('-----------------------')# seperator

So what is happening that whenever i execute the program it prints the dl stuff same for every name and data like this

donald duck
Email
myemail.com
-------------
mickey mouse
Email
myemail.com

I have already tried putting dl in for loop the same way i am doing to print name but it prints other things as well that i don't want

what can i do?

Muzzamil
  • 2,823
  • 2
  • 11
  • 23

2 Answers2

0

driver.find_element_by_tag_name('dl') will always return the first matching element. You need to use element to locate the <dl>s

for element in driver.find_elements_by_class_name('ThatsThem-record-overview'):
    dl = element.find_element_by_tag_name('dl') # scraping data under dl tag 
    print(dl.text)

Or just locate those elements directly

for element in driver.find_elements_by_css_selector('.ThatsThem-record-overview dl'):
    print(element.text)
Guy
  • 46,488
  • 10
  • 44
  • 88
0

Seems you were close. Using the class record-overview should have fetched you all the required data. However it would be better to target the individual name and email by traversing to the child tags. Additionally inducing WebDriverWait will optimize your program performance.

So, ideally you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    names[] = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.record-overview>h2>span")))]
    emails[] = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.record-overview dl.row dd")))]
    for name, email in zip(names, emails):
        print("{} Email is {}".format(name, email))
    
  • Using XPATH:

    names[] = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'record-overview')]/h2/span")))]
    emails[] = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'record-overview')]//dl[@class='row']//dd")))]
    for name, email in zip(names, emails):
        print("{} Email is {}".format(name, email))
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352