How to find the text based on preceding text or classname using Selenium and Python

Question

I'm new to web scraping, and I've been using Selenium for this particular project. In this example, I'm crawling through the listings on a website and they are structured as follows...

Listing 1:

<html>
     <div class="div_class">
          <i class="first_i_class" style="i_style"> ::before </i>
          First Category: 
          <span class="span_class">5</span>
          <br>
          <i class="second_i_class" style="i_style"> ::before </i>
          Second Category: 
          <span class="span_class">3</span>
          <br>
     </div>
</html>

As you can see, the values for the first and second categories are similar, so finding all elements and then using a regex won't work here. I need to be able to get the text (5 and 3, in this example) based on the preceding text, in this case "First Category: " or "Second Category: ". Some listings, however, might skip certain categories and look like this...

Listing 2:

<html>
     <div class="div_class">
          <i class="third_i_class" style="i_style"> ::before </i>
          Third Category: 
          <span class="span_class">7</span>
          <br>
     </div>
</html>

Because the categories change between listings, I don't think I can use something like:

cat_2_value = browser.find_element_by_xpath("/html/div/span[2][@class='span_class']")

because the xpath will also change. Is there a way that I can find the text in a given span based on either

The preceding text (like "First Category: ") or
The preceding  class (like "first_i_class")?

Any help or clarifying questions are much appreciated!

One thought is that I could try to find all of the text associated with each
tag? I think that would include both the category and the value? But I'm not sure if there is an easier way. — DRo, Jun 29 '20 at 10:04

score 0 · Answer 1 · answered Jun 29 '20 at 13:29

To extract the texts 5, 3, etc with respect to the preceding class first_i_class, second_i_class etc, you need to induce WebDriverWait for the visibility_of_element_located() and you can use the following xpath based Locator Strategies:

Printing 5:

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='div_class']//i[@class='first_i_class']//following::span[1]"))).text)

Printing 3:

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='div_class']//i[@class='second_i_class']//following::span[1]"))).text)

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

score 0 · Accepted Answer · answered Jul 02 '20 at 00:29

To complete @DebanjanB's answer, other options. As you requested :

The preceding text (like "First Category: ") :

//span[preceding::text()[1][normalize-space()="First Category:"]]

Output : 5

The preceding class (like "first_i_class") :

//span[preceding-sibling::i[1][@class="first_i_class"]]

or

(//span[preceding-sibling::i[1][contains(@class,"i_class")]])[1]

Output : 5

If you want to get the second span, replace "first_i_class" with "second_i_class" in the first expression or change the last [1] for [2] in the second expression.

To get directly all the span elements, use :

//span[preceding-sibling::i[1][contains(@class,"i_class")]]

Output : 5 3 7

How to find the text based on preceding text or classname using Selenium and Python

2 Answers2