Selenium Web scraping nested divs with no ids or class names

Question

I am trying to get the product name and the quantity from a nested HTML table using selenium. My problem is some of the divs don't have any id or class names. The table I am trying to access is the Critical Product list. Here is what I have done but I do seem to be lost at how I can get the nested divs. The site is in the code.

options = Options()
options.add_argument('start-maximized')

driver = webdriver.Chrome(chrome_options=options, executable_path=r'/usr/local/bin/chromedriver/')
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
time.sleep(150)
page = driver.page_source
driver.quit()


html_soup = BeautifulSoup(page, 'html.parser')
item_containers = html_soup.find_all('div', class_='critical-products-title hide-mobile')

if item_containers:
    for item in item_containers:
       for link in item.findAll('a', ) # need to loop the inner divs to reach the href and then get to the left and right classes to get title and quantity
        print(item)

Here is the image from the inspection. I want to be able to loop through all the divs and get the title and quantity.

score 1 · Answer 1 · answered Sep 02 '20 at 04:26

You don't need beautiful soup, nor to save the page_source. I used a CSS selector to select all the target rows in the table and then applied list comprehension to choose the left and right sides of each row. I outputted the results to a list of tuples.

options = Options()
options.add_argument('start-maximized')

driver = webdriver.Chrome(chrome_options=options, executable_path=r'/usr/local/bin/chromedriver/')
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
time.sleep(150)

elements = driver.find_elements_by_css_selector('#app > div:nth-child(1) > div.header-wrapper > div.header-right > div.critical-product-table-container > div.table.shorten.hide-mobile > div > div > div > a > div')

targetted_values = [(element.find_element_by_css_selector('.line-item-left').text, element.find_element_by_css_selector('.line-item-right').text) for element in elements]

driver.quit()

Example output of targetted_values:

[('Surgical & Reusable Masks', '376,713,363 available'),
('Disposable Gloves', '66,962,093 available'),
('Gowns and Coveralls', '40,502,145 available'),
('Respirators', '22,189,273 available'),
('Surface Wipes', '20,650,831 available'),
('Face Shields', '16,535,686 available'),
('Hand Sanitizer', '11,152,890 L available'),
('Thermometers', '8,457,993 available'),
('Testing Kits', '2,110,815 available'),
('Surface Solutions', '107,452 L available'),
('Protective Barriers', '10,833 available'),
('Ventilators', '410 available')]

How are you getting the output? When I put print statement, I don't get anything in return — Saad, Sep 02 '20 at 18:03
@Saad Everything is stored in targetted_values. Try print(targetted_values) — Johnathan Irvin, Sep 02 '20 at 22:33

undetected Selenium · Answer 2 · 2020-09-02T08:16:49.370

To print the product name and the quantity you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

Using CSS_SELECTOR and text attribute:

driver.get('https://www.rrpcanada.org/#/')
items =  [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table.shorten.hide-mobile > div div.line-item-title")))]
quantities =  [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table.shorten.hide-mobile > div div.line-item-bold.available")))]
for i,j in zip(items,quantities):
  print(i, j)

Using XPATH and get_attribute("innerHTML"):

driver.get('https://www.rrpcanada.org/#/')
items =  [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='table shorten hide-mobile']/div//div[@class='line-item-title']")))]
quantities =  [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='table shorten hide-mobile']/div//div[@class='line-item-bold available']")))]
for i,j in zip(items,quantities):
  print(i, j)

Console Output:

Surgical &amp; Reusable Masks  376,713,363 available
Disposable Gloves  66,962,093 available
Gowns and Coveralls  40,502,145 available
Respirators  22,189,273 available
Surface Wipes  20,650,831 available
Face Shields  16,535,686 available
Hand Sanitizer  11,152,890 L available
Thermometers  8,457,993 available
Testing Kits  2,110,815 available
Surface Solutions  107,452 L available
Protective Barriers  10,833 available
Ventilators  410 available

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python

Outro

Link to useful documentation:

get_attribute() method Gets the given attribute or property of the element.
text attribute returns The text of the element.
Difference between text and innerHTML using Selenium

@Saad IMO _reliability and scalability_ is related to framework. However, locating strategies should be canonical be it a _xpath_ or a _cssSelector_. Hence was my answer. — undetected Selenium, Sep 02 '20 at 19:20

score 0 · Answer 3 · answered Sep 02 '20 at 04:27

You have to use relative xpath to find the element with class="line-item-left" for the name of each item and the element with class="line-item-right" for the number of available items.

driver.find_elements_by_class_name("line-item-left") //Item names
driver.find_elements_by_class_name("line-item-right") //Number of items available

Note the 's' in elements

score 0 · Answer 4 · answered Sep 02 '20 at 06:48

This is the selector for product name:

div.critical-product-table-container div.line-item-left

And for total:

div.critical-product-table-container div.line-item-right

But the following approach is without BeautifulSoup.

time.sleep(...) is bad practice, please use WebDriverWait instead.

And to pair the above two variables and perform parallel looping, I try to use the zip() function:

url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
wait = WebDriverWait(driver, 150)
product_names = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.critical-product-table-container div.line-item-left')))
totals = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.critical-product-table-container div.line-item-right')))

for product_name, total in zip(product_names, totals):
    print(product_name.text +'--' +total.text)
    
driver.quit()

You need following import:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Selenium Web scraping nested divs with no ids or class names

4 Answers4

Outro