1

I am doing my first steps with Selenium in Python and want to extract a certain value from a webpage. The value i need to find on the webpage is the ID (Melde-ID), which is 355460. In the html i found the 2 lines containing my info:

<h3 _ngcontent-wwf-c32="" class="title"> Melde-ID: 355460 </h3><span _ngcontent-wwf-c32="">
<div _ngcontent-wwf-c27="" class="label"> Melde-ID </div><div _ngcontent-wwf-c27="" class="value">

I have been searching websites for about 2 hours for what command to use but i don't know what to actually search for in the html. The website is a html with .js modules. It works to open the URL over selenium.

(At first i tried using beautifulsoup but was not able to open the page for some restriction. I did verify that the robots.txt does not disallow anything, but the error on beautifulsoup was "Unfortunately, a problem occurred while forwarding your request to the backend server".)

I would be thankful for any advice and hope i did explain my issue. The code i tried to create in Jupyter Notebook with Selenium installed is as follows:

from selenium import webdriver
import codecs
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

url = "https://...."
driver = webdriver.Chrome('./chromedriver')
driver.implicitly_wait(0.5)
#maximize browser
driver.maximize_window()
#launch URL
driver.get(url)
#print(driver.page_source)
#Try 2
#print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[normalize-space()='Melde-ID']")))])
#close browser
driver.quit()
Prophet
  • 32,350
  • 22
  • 54
  • 79
  • The error you mentioned is not a typical one for `BeautifulSoup`, providing the url could clarify on what is going on with it and the connection to the server. Also clean your example code, it do not need all these uncomment lines. Thanks – HedgeHog Aug 18 '22 at 09:39

2 Answers2

0

From the information you shared here we can see that the element containing the desired information doesn't have class name attribute with a value of Melde-ID.
It has class name with value of title and contains text Melde-ID.
Also, you should use webdriver wait expected condition instead of driver.implicitly_wait(0.5).
With these changes your code can be something like this:

from selenium import webdriver
import codecs
import os
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

url = "https://...."
driver = webdriver.Chrome('./chromedriver')

wait = WebDriverWait(driver, 20)

#maximize browser
driver.maximize_window()
#launch URL
driver.get(url)

content = wait.until(EC.visibility_of_element_located((By.XPATH, "//*[contains(@class,'title') and contains(.,'Melde-ID:')]"))).text

I added .text to extract the text from that web element.
Now content should contain Melde-ID: 355460 value.

Prophet
  • 32,350
  • 22
  • 54
  • 79
0

Given the HTML:

<h3 _ngcontent-wwf-c32="" class="title"> Melde-ID: 355460 </h3>
<span _ngcontent-wwf-c32="">
    <div _ngcontent-wwf-c27="" class="label"> Melde-ID </div>
    <div _ngcontent-wwf-c27="" class="value">

To extract the text 355460 you need to induce WebDriverWait for the visibility_of_element_located() and extracting the text you have to split the text with respect to the : character and print the second part using either of the following locator strategies:

  • Using CSS_SELECTOR and text attribute:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h3.title"))).text.split(':')[1])
    
  • Using XPATH and get_attribute("innerHTML"):

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[@class='title' and text()]"))).get_attribute("innerHTML").split(':')[1])
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352