0

I'm using the python chrome webdriver to extract all the text from the description section of a geocaching website (Here's a sample website if anyone wants to take a look). The text is stored in different <p> elements inside one <span> element. I cannot figure out how to take all the <p> elements and save them as one string separated with spaces.

I tried using both of the solutions underneath, the first one only outputted the text from the first <p> element and the second one sometimes outputted the first one, sometimes more (but not all). I couldn't figure out why the second one is inconsistent with the number of elements.

desc_span = driver.find_element(By.XPATH, '/html/body/form[1]/main/div/div/div[2]/div[9]/span')
        p_elements = desc_span.find_elements(By.TAG_NAME, 'p')
        desc = ' '.join(p_element.text for p_element in p_elements)
        print(desc)
desc_div = driver.find_element(By.XPATH, '/html/body/form[1]/main/div/div/div[2]/div[9]')
        all_elements = desc_div.find_elements(By.XPATH, '*') 
        desc = ' '.join(element.text for element in all_elements)
        print(desc)

2 Answers2

2

I think, you should wait for visibility of all elements located by selector, and, probably, change selector.

Try code below:

wait = WebDriverWait(driver, 10)
p_elements = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.UserSuppliedContent p')))
desc = ' '.join(p_element.text for p_element in p_elements)
print(desc)

Using your link (I am not logged in), output is It's a "W" Thang. This is my first cache that I have submitted. Placed with permission. Magnetic that corresponding all p tags in description section. So, you're on right way, just need to wait until all elements are rendered.

Yaroslavm
  • 1,762
  • 2
  • 7
  • 15
0

The desired texts are within <p> tags which have an ancestor <div class="UserSuppliedContent">


Solution

To extract all the text from the description section of the geocaching website and put into a list you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategy:

driver.get(url='https://www.geocaching.com/geocache/GC4ZJ9R')
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.UserSuppliedContent p")))])

Console Output:

['It\'s a "W" Thang. This is my first cache that I have submitted. Placed with permission.', 'Magnetic']

Further, if you want to take all the <p> elements and save them as one string separated with spaces you need to use join() and you can use the following solution:

driver.get(url='https://www.geocaching.com/geocache/GC4ZJ9R')
print("".join(my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.UserSuppliedContent p")))))

Console Output:

It's a "W" Thang. This is my first cache that I have submitted. Placed with permission.Magnetic

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352