0

I'm new to web scraping, and I've been using Selenium for this particular project. In this example, I'm crawling through the listings on a website and they are structured as follows...

Listing 1:

<html>
     <div class="div_class">
          <i class="first_i_class" style="i_style"> ::before </i>
          First Category: 
          <span class="span_class">5</span>
          <br>
          <i class="second_i_class" style="i_style"> ::before </i>
          Second Category: 
          <span class="span_class">3</span>
          <br>
     </div>
</html>

As you can see, the values for the first and second categories are similar, so finding all elements and then using a regex won't work here. I need to be able to get the text (5 and 3, in this example) based on the preceding text, in this case "First Category: " or "Second Category: ". Some listings, however, might skip certain categories and look like this...

Listing 2:

<html>
     <div class="div_class">
          <i class="third_i_class" style="i_style"> ::before </i>
          Third Category: 
          <span class="span_class">7</span>
          <br>
     </div>
</html>

Because the categories change between listings, I don't think I can use something like:

cat_2_value = browser.find_element_by_xpath("/html/div/span[2][@class='span_class']")

because the xpath will also change. Is there a way that I can find the text in a given span based on either

  1. The preceding text (like "First Category: ") or
  2. The preceding <i> class (like "first_i_class")?

Any help or clarifying questions are much appreciated!

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
DRo
  • 3
  • 1
  • One thought is that I could try to find all of the text associated with each
    tag? I think that would include both the category and the value? But I'm not sure if there is an easier way.
    – DRo Jun 29 '20 at 10:04

2 Answers2

0

To extract the texts 5, 3, etc with respect to the preceding class first_i_class, second_i_class etc, you need to induce WebDriverWait for the visibility_of_element_located() and you can use the following based Locator Strategies:

  • Printing 5:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='div_class']//i[@class='first_i_class']//following::span[1]"))).text)
    
  • Printing 3:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='div_class']//i[@class='second_i_class']//following::span[1]"))).text)
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

To complete @DebanjanB's answer, other options. As you requested :

The preceding text (like "First Category: ") :

//span[preceding::text()[1][normalize-space()="First Category:"]]

Output : <span class="span_class">5</span>

The preceding class (like "first_i_class") :

//span[preceding-sibling::i[1][@class="first_i_class"]]

or

(//span[preceding-sibling::i[1][contains(@class,"i_class")]])[1]

Output : <span class="span_class">5</span>

If you want to get the second span, replace "first_i_class" with "second_i_class" in the first expression or change the last [1] for [2] in the second expression.

To get directly all the span elements, use :

//span[preceding-sibling::i[1][contains(@class,"i_class")]]

Output : <span class="span_class">5</span> <span class="span_class">3</span> <span class="span_class">7</span>

E.Wiest
  • 5,425
  • 2
  • 7
  • 12