0

I am trying to get the product name and the quantity from a nested HTML table using selenium. My problem is some of the divs don't have any id or class names. The table I am trying to access is the Critical Product list. Here is what I have done but I do seem to be lost at how I can get the nested divs. The site is in the code.

options = Options()
options.add_argument('start-maximized')

driver = webdriver.Chrome(chrome_options=options, executable_path=r'/usr/local/bin/chromedriver/')
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
time.sleep(150)
page = driver.page_source
driver.quit()


html_soup = BeautifulSoup(page, 'html.parser')
item_containers = html_soup.find_all('div', class_='critical-products-title hide-mobile')

if item_containers:
    for item in item_containers:
       for link in item.findAll('a', ) # need to loop the inner divs to reach the href and then get to the left and right classes to get title and quantity
        print(item)

Here is the image from the inspection. I want to be able to loop through all the divs and get the title and quantity.

enter image description here

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Saad
  • 399
  • 8
  • 25

4 Answers4

1

You don't need beautiful soup, nor to save the page_source. I used a CSS selector to select all the target rows in the table and then applied list comprehension to choose the left and right sides of each row. I outputted the results to a list of tuples.

options = Options()
options.add_argument('start-maximized')

driver = webdriver.Chrome(chrome_options=options, executable_path=r'/usr/local/bin/chromedriver/')
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
time.sleep(150)

elements = driver.find_elements_by_css_selector('#app > div:nth-child(1) > div.header-wrapper > div.header-right > div.critical-product-table-container > div.table.shorten.hide-mobile > div > div > div > a > div')

targetted_values = [(element.find_element_by_css_selector('.line-item-left').text, element.find_element_by_css_selector('.line-item-right').text) for element in elements]

driver.quit()

Example output of targetted_values:

[('Surgical & Reusable Masks', '376,713,363 available'),
('Disposable Gloves', '66,962,093 available'),
('Gowns and Coveralls', '40,502,145 available'),
('Respirators', '22,189,273 available'),
('Surface Wipes', '20,650,831 available'),
('Face Shields', '16,535,686 available'),
('Hand Sanitizer', '11,152,890 L available'),
('Thermometers', '8,457,993 available'),
('Testing Kits', '2,110,815 available'),
('Surface Solutions', '107,452 L available'),
('Protective Barriers', '10,833 available'),
('Ventilators', '410 available')]
1

To print the product name and the quantity you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR and text attribute:

    driver.get('https://www.rrpcanada.org/#/')
    items =  [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table.shorten.hide-mobile > div div.line-item-title")))]
    quantities =  [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table.shorten.hide-mobile > div div.line-item-bold.available")))]
    for i,j in zip(items,quantities):
      print(i, j)
    
  • Using XPATH and get_attribute("innerHTML"):

    driver.get('https://www.rrpcanada.org/#/')
    items =  [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='table shorten hide-mobile']/div//div[@class='line-item-title']")))]
    quantities =  [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='table shorten hide-mobile']/div//div[@class='line-item-bold available']")))]
    for i,j in zip(items,quantities):
      print(i, j)
    
  • Console Output:

    Surgical & Reusable Masks  376,713,363 available
    Disposable Gloves  66,962,093 available
    Gowns and Coveralls  40,502,145 available
    Respirators  22,189,273 available
    Surface Wipes  20,650,831 available
    Face Shields  16,535,686 available
    Hand Sanitizer  11,152,890 L available
    Thermometers  8,457,993 available
    Testing Kits  2,110,815 available
    Surface Solutions  107,452 L available
    Protective Barriers  10,833 available
    Ventilators  410 available
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python


Outro

Link to useful documentation:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

You have to use relative xpath to find the element with class="line-item-left" for the name of each item and the element with class="line-item-right" for the number of available items.

driver.find_elements_by_class_name("line-item-left") //Item names
driver.find_elements_by_class_name("line-item-right") //Number of items available

Note the 's' in elements

TheLegend42
  • 71
  • 1
  • 5
0

This is the selector for product name:

div.critical-product-table-container div.line-item-left

And for total:

div.critical-product-table-container div.line-item-right

But the following approach is without BeautifulSoup.

time.sleep(...) is bad practice, please use WebDriverWait instead.

And to pair the above two variables and perform parallel looping, I try to use the zip() function:

url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
wait = WebDriverWait(driver, 150)
product_names = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.critical-product-table-container div.line-item-left')))
totals = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.critical-product-table-container div.line-item-right')))

for product_name, total in zip(product_names, totals):
    print(product_name.text +'--' +total.text)
    
driver.quit()

You need following import:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
frianH
  • 7,295
  • 6
  • 20
  • 45