1

By using 'contains' function how can I extract information from this type of html structure, I am trying to scrape "H MATTHEWS" this information

HTML:

<p>
<strong>Date Published:</strong>
&nbsp; 20 APRIL 2020
<br>
<strong>Closing Date / Time:</strong>
&nbsp;TUESDAY, 05 MAY 2020
<br>
<strong>Enquiries:</strong>
<br>
Contact Person: H MATTHEWS
<br>
Email:&nbsp;
</p>

HTML image:

enter image description here

  • Stack Overflow is neither a forum nor a tutorial, code-writing, or homework service. This is a Q&A site where *specific* programming questions (usually, but not always, including some code) get *specific* answers. Please take the [tour] and carefully read through the [help] to learn more about the site, including [what is on-topic](https://stackoverflow.com/help/on-topic) and [what is not](https://stackoverflow.com/help/dont-ask), and how to [ask a good question](https://stackoverflow.com/help/how-to-ask). Please also follow the [question checklist](https://meta.stackoverflow.com/q/260648). – MattDMo Jul 21 '20 at 20:37
  • There's no way anyone can answer this question without knowing what the HTML actually *is*. – MattDMo Jul 21 '20 at 20:38
  • @MattDMo I am new in stackover flow, I am still trying to upload the html structure. – Muntaaha Rahman Jul 21 '20 at 20:41
  • @MattDMo can u help me now

    Date Published:   20 APRIL 2020
    Closing Date / Time:  TUESDAY, 05 MAY 2020
    Enquiries:
    Contact Person: H MATTHEWS
    Email: 

    – Muntaaha Rahman Jul 21 '20 at 20:42
  • 1
    Please [edit] your question and post the code you have tried so far, the output you're getting (if any), and the **full text** of any errors or tracebacks. – MattDMo Jul 21 '20 at 20:50

1 Answers1

0

The text Contact Person: H MATTHEWS is within a text node. So to printthe text you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

  • Using XPATH and childNodes:

    print(driver.execute_script('return arguments[0].childNodes[9].textContent;', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//p[./strong[text()='Date Published:']]")))).strip())
    
  • Using XPATH and splitlines():

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//p[./strong[text()='Date Published:']]"))).get_attribute("innerHTML").splitlines()[-3])
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

If your usecase is to extract only the text H MATTHEWS you can use either of the following solutions:

  • Using XPATH and childNodes:

    print(re.split('[:]', driver.execute_script('return arguments[0].childNodes[9].textContent;', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//p[./strong[text()='Date Published:']]")))).strip())[1])
    
  • Using XPATH and splitlines():

    print(re.split('[:]', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//p[./strong[text()='Date Published:']]"))).get_attribute("innerHTML").splitlines()[-3])[1])
    

Reference

You can find a detailed relevant discussion in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352