How to extract the text H MATTHEWS from the html using Selenium and Python

Question

By using 'contains' function how can I extract information from this type of html structure, I am trying to scrape "H MATTHEWS" this information

HTML:

<p>
<strong>Date Published:</strong>
&nbsp; 20 APRIL 2020
<br>
<strong>Closing Date / Time:</strong>
&nbsp;TUESDAY, 05 MAY 2020
<br>
<strong>Enquiries:</strong>
<br>
Contact Person: H MATTHEWS
<br>
Email:&nbsp;
</p>

HTML image:

enter image description here

Stack Overflow is neither a forum nor a tutorial, code-writing, or homework service. This is a Q&A site where *specific* programming questions (usually, but not always, including some code) get *specific* answers. Please take the [tour] and carefully read through the [help] to learn more about the site, including [what is on-topic](https://stackoverflow.com/help/on-topic) and [what is not](https://stackoverflow.com/help/dont-ask), and how to [ask a good question](https://stackoverflow.com/help/how-to-ask). Please also follow the [question checklist](https://meta.stackoverflow.com/q/260648). — MattDMo, Jul 21 '20 at 20:37
There's no way anyone can answer this question without knowing what the HTML actually *is*. — MattDMo, Jul 21 '20 at 20:38
@MattDMo I am new in stackover flow, I am still trying to upload the html structure. — Muntaaha Rahman, Jul 21 '20 at 20:41
@MattDMo can u help me now
Date Published: 20 APRIL 2020
Closing Date / Time: TUESDAY, 05 MAY 2020
Enquiries:
Contact Person: H MATTHEWS
Email: — Muntaaha Rahman, Jul 21 '20 at 20:42
Please [edit] your question and post the code you have tried so far, the output you're getting (if any), and the **full text** of any errors or tracebacks. — MattDMo, Jul 21 '20 at 20:50

undetected Selenium · Answer 1 · 2020-07-21T21:23:23.270

The text Contact Person: H MATTHEWS is within a text node. So to printthe text you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

Using XPATH and childNodes:

print(driver.execute_script('return arguments[0].childNodes[9].textContent;', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//p[./strong[text()='Date Published:']]")))).strip())

Using XPATH and splitlines():

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//p[./strong[text()='Date Published:']]"))).get_attribute("innerHTML").splitlines()[-3])

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

If your usecase is to extract only the text H MATTHEWS you can use either of the following solutions:

Using XPATH and childNodes:

print(re.split('[:]', driver.execute_script('return arguments[0].childNodes[9].textContent;', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//p[./strong[text()='Date Published:']]")))).strip())[1])

Using XPATH and splitlines():

print(re.split('[:]', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//p[./strong[text()='Date Published:']]"))).get_attribute("innerHTML").splitlines()[-3])[1])

Reference

You can find a detailed relevant discussion in:

How to extract the text H MATTHEWS from the html using Selenium and Python

1 Answers1

Reference