How to extract texts from multiple textnodes within an element using Selenium and BeautifulSoup

Question

I have an element found through BeautifulSoup that (HTML) looks like this:

  <div class="ListingData">
    <span id="l_Contract" class="contract">Vendita Residenziale</span><br />
    New York<br />
    Appartamento<br />
    <strong>Prezzo:</strong>
    &euro; 100.000/200.000
    - <strong>Metri quadri:</strong>
    130/170
    </div>

And I need to get in one variable Vendita Residenziale, in another New York,in another Appartamento , in another 100.000/200.000 (not the strong tag) and in the last one 130/170.

I can extract the span tag text doing:

x = ele.find('span', attrs = {'class': 'contract'}).get_text()

but I'm struggling to get the other information, I tried to:

y = ele.find('div', attrs = {'class':'ListingData'}).get_text().replace("\n","").strip()

but this gives me all the div content and that's okay but I need to get the individual lines of information like a "result[1]" for New York, "result[2]" for Appartamento and so on. Is there a method?

score 0 · Answer 1 · answered Apr 12 '19 at 22:49

I used a mix of BeautifulSoup4 and Regular Expression, you can toy with the regex.

a=bs4.BeautifulSoup(txt,'html.parser')
a.findAll(id="l_Contract")[0].text # Vendita Residenziale
p=re.compile("<br />").split(txt)
p[1] # "New York"
p[2] # "Appartamento"
re.compile("&euro;\s+([0-9.]+\/[0-9.]+)\s+-\s+<strong>").search(txt).group(1) #100.000/200.000

Another way would be simply doing this

a.findAll(class_="ListingData")[0].text
#Output
'\nVendita Residenziale\n    New York\n    Appartamento\nPrezzo:\n    € 100.000/200.000\n    - Metri quadri:\n    130/170\n    '

Wich is easier to parse.

score 0 · Answer 2 · answered Apr 12 '19 at 23:47

Since all text you want is in the <div> tag, seems the easiest way would be to get the <div> text, and split the text on newlines '\n' into a result list:

result = [e.strip() for e in ele.div.text.strip().split('\n')]

>>> result
[u'Vendita Residenziale', u'New York', u'Appartamento', u'Prezzo:', u'\u20ac 100.000/200.000', u'- Metri quadri:', u'130/170']

which can then be indexed as desired:

for n, res in enumerate(result):
    print(f'result[{n}] = {res}')

result[0] = Vendita Residenziale
result[1] = New York
result[2] = Appartamento
result[3] = Prezzo:
result[4] = € 100.000/200.000
result[5] = - Metri quadri:
result[6] = 130/170

QHarr · Answer 3 · 2019-04-13T10:57:08.077

You could use navigable string and .contents

from bs4 import BeautifulSoup, NavigableString

html = '''
<div class="ListingData">
    <span id="l_Contract" class="contract">Vendita Residenziale</span><br />
    New York<br />
    Appartamento<br />
    <strong>Prezzo:</strong>
    &euro; 100.000/200.000
    - <strong>Metri quadri:</strong>
    130/170
    </div>
'''

soup = bs(html, 'lxml')
item1 = soup.select_one('#l_Contract').text
items = soup.select_one('.ListingData').contents
results = []
for item in items:
    if isinstance(item, NavigableString) and item.strip():
        results.append(item.strip())

item2 = results[0]
item3 = results[1]
item4 = results[2]

print(item1, ',', item2, ',', item3, ',', item4)

score 0 · Answer 4 · answered Apr 13 '19 at 02:28

not really a bs4 issue here, the other data you want isn't inside of span tags, extract your data based on string observations

sp=sp.find('div',id='onesiwant')
for div in sp:
    all=div.text.strip()
    #now you can split('\n') 
     html=str(div)
     get the stuff out of span
        now split by '<br>' tags

your asking how to use bs4 to get data out of text in between tags or seperated by \n, so bs4 here is not necessary, just string manipulation

score 0 · Answer 5 · answered Apr 13 '19 at 07:34

Selenium alone can extract all the required texts and you can use the following solution:

element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "div[@class='ListingData']")))
text_Vendita_Residenziale = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "div[@class='ListingData']/span[@class='contract' and contains(@id='Contract')]")))
text_NewYork = driver.execute_script('return arguments[0].childNodes[3].textContent;', element).strip()
text_Appartamento = driver.execute_script('return arguments[0].childNodes[5].textContent;', element).strip()
text_100_200 = driver.execute_script('return arguments[0].childNodes[8].textContent;', element).strip()
text_130_170 = driver.execute_script('return arguments[0].lastChild.textContent;', element).strip()

Unfortunately when I run this code it raise a TimeoutException on the first line (element = ...), I imported everything like this: `from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException` — Filippo Foladore, Apr 13 '19 at 13:05
_...TimeoutException on the first line (element )..._ means the desired element wasn't uniquely identified even after _WebDriverWait_. Can you update the question with a bit more of the outerHTML? — undetected Selenium, Apr 13 '19 at 18:56

How to extract texts from multiple textnodes within an element using Selenium and BeautifulSoup

5 Answers5