3

I am trying to extract the search results count from an IEEE Xplore search given the search results URL using the selenium webdriver. I'm not getting any errors from the code below, but I am unsure how to proceed from here.

Website Element of Interest: Website Element of Interest

Element Inspection Results: Element Inspection Results

url = 'https://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=web%20scraping'
chrome_driver_path = '\\xxxx\chromedriver.exe'
driver.get(url)
wait.until(presence_of_element_located((By.CLASS_NAME, "strong")))
#result = driver.??????
print(result)
driver.close()
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • 1
    Honestly, will be better to try first their API endpoint to get this data. You can open devtools in your browser, Network tab and then find a POST request to the `https://ieeexplore.ieee.org/rest/search` endpoint – dukkee Jan 25 '21 at 18:14
  • @dukkee Thank you for the response. I've considered this, but this is in-part a learning exercise for scraping in general for me and since it's for personal use, I don't have a website or company affiliation for their API application form. – ArchMorlock Jan 26 '21 at 19:50
  • what do you mean by "their API application form"? Internal API usage is the same scraping as layout scraping. – dukkee Jan 26 '21 at 20:12

2 Answers2

1

To print the number of search results i.e. 184 you can use either of the following Locator Strategies:

  • Using css_selector and get_attribute("innerHTML"):

    driver.get("https://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=web%20scraping")
    print(driver.find_element(By.CSS_SELECTOR, "div.Dashboard-header span span:nth-of-type(2) ").get_attribute("innerHTML"))
    
  • Using xpath and text attribute:

    driver.get("https://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=web%20scraping")
    print(driver.find_element(By.XPATH, "//div[contains(@class, 'Dashboard-header')]//span//following::span[2]").text)
    

Ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR and text attribute:

    driver.get("https://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=web%20scraping")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.Dashboard-header span span:nth-of-type(2)"))).text)
    
  • Using XPATH and get_attribute("innerHTML"):

    driver.get("https://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=web%20scraping")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class, 'Dashboard-header')]//span//following::span[2]"))).get_attribute("innerHTML"))
    
  • Console Output:

    184
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python


References

Link to useful documentation:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

As dukkee mentioned check the api, but to answer your question you can select it like:

soup.select('div.Dashboard-header.col-12 > span span')[1].get_text()

Locate a parent div with unique class and then go down to the span.

Example

from selenium import webdriver
from bs4 import BeautifulSoup
import time

url = 'https://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=web%20scraping'
driver = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
time.sleep(3)

html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
print(soup.select('div.Dashboard-header.col-12 > span span')[1].get_text())

driver.quit()
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • Thanks for the process explanation. The example worked perfectly for me, though I am sometimes getting an "IndexError: list index out of range" error. It seems that the page doesn't always load, which causes this error. If I try the same search a bit later, it's working fine. – ArchMorlock Jan 26 '21 at 19:59
  • You can handle this even better with [Waits](https://selenium-python.readthedocs.io/waits.html#explicit-waits) instead of `sleep` - Cause you can define to wait for a certain condition e.g. element is loaded and than proceed in the code. – HedgeHog Jan 26 '21 at 20:09