0

I have an element found through BeautifulSoup that (HTML) looks like this:

  <div class="ListingData">
    <span id="l_Contract" class="contract">Vendita Residenziale</span><br />
    New York<br />
    Appartamento<br />
    <strong>Prezzo:</strong>
    &euro; 100.000/200.000
    - <strong>Metri quadri:</strong>
    130/170
    </div>

And I need to get in one variable Vendita Residenziale, in another New York,in another Appartamento , in another 100.000/200.000 (not the strong tag) and in the last one 130/170.

I can extract the span tag text doing:

x = ele.find('span', attrs = {'class': 'contract'}).get_text()

but I'm struggling to get the other information, I tried to:

y = ele.find('div', attrs = {'class':'ListingData'}).get_text().replace("\n","").strip()

but this gives me all the div content and that's okay but I need to get the individual lines of information like a "result[1]" for New York, "result[2]" for Appartamento and so on. Is there a method?

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352

5 Answers5

0

I used a mix of BeautifulSoup4 and Regular Expression, you can toy with the regex.

a=bs4.BeautifulSoup(txt,'html.parser')
a.findAll(id="l_Contract")[0].text # Vendita Residenziale
p=re.compile("<br />").split(txt)
p[1] # "New York"
p[2] # "Appartamento"
re.compile("&euro;\s+([0-9.]+\/[0-9.]+)\s+-\s+<strong>").search(txt).group(1) #100.000/200.000

Another way would be simply doing this

a.findAll(class_="ListingData")[0].text
#Output
'\nVendita Residenziale\n    New York\n    Appartamento\nPrezzo:\n    € 100.000/200.000\n    - Metri quadri:\n    130/170\n    '

Wich is easier to parse.

0

Since all text you want is in the <div> tag, seems the easiest way would be to get the <div> text, and split the text on newlines '\n' into a result list:

result = [e.strip() for e in ele.div.text.strip().split('\n')]

>>> result
[u'Vendita Residenziale', u'New York', u'Appartamento', u'Prezzo:', u'\u20ac 100.000/200.000', u'- Metri quadri:', u'130/170']

which can then be indexed as desired:

for n, res in enumerate(result):
    print(f'result[{n}] = {res}')

result[0] = Vendita Residenziale
result[1] = New York
result[2] = Appartamento
result[3] = Prezzo:
result[4] = € 100.000/200.000
result[5] = - Metri quadri:
result[6] = 130/170
0

You could use navigable string and .contents

from bs4 import BeautifulSoup, NavigableString

html = '''
<div class="ListingData">
    <span id="l_Contract" class="contract">Vendita Residenziale</span><br />
    New York<br />
    Appartamento<br />
    <strong>Prezzo:</strong>
    &euro; 100.000/200.000
    - <strong>Metri quadri:</strong>
    130/170
    </div>
'''

soup = bs(html, 'lxml')
item1 = soup.select_one('#l_Contract').text
items = soup.select_one('.ListingData').contents
results = []
for item in items:
    if isinstance(item, NavigableString) and item.strip():
        results.append(item.strip())

item2 = results[0]
item3 = results[1]
item4 = results[2]

print(item1, ',', item2, ',', item3, ',', item4)
QHarr
  • 83,427
  • 12
  • 54
  • 101
0

not really a bs4 issue here, the other data you want isn't inside of span tags, extract your data based on string observations

sp=sp.find('div',id='onesiwant')
for div in sp:
    all=div.text.strip()
    #now you can split('\n') 
     html=str(div)
     get the stuff out of span
        now split by '<br>' tags

your asking how to use bs4 to get data out of text in between tags or seperated by \n, so bs4 here is not necessary, just string manipulation

Edo Edo
  • 164
  • 2
  • 9
0

Selenium alone can extract all the required texts and you can use the following solution:

element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "div[@class='ListingData']")))
text_Vendita_Residenziale = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "div[@class='ListingData']/span[@class='contract' and contains(@id='Contract')]")))
text_NewYork = driver.execute_script('return arguments[0].childNodes[3].textContent;', element).strip()
text_Appartamento = driver.execute_script('return arguments[0].childNodes[5].textContent;', element).strip()
text_100_200 = driver.execute_script('return arguments[0].childNodes[8].textContent;', element).strip()
text_130_170 = driver.execute_script('return arguments[0].lastChild.textContent;', element).strip()
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Unfortunately when I run this code it raise a TimeoutException on the first line (element = ...), I imported everything like this: `from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException` – Filippo Foladore Apr 13 '19 at 13:05
  • _...TimeoutException on the first line (element )..._ means the desired element wasn't uniquely identified even after _WebDriverWait_. Can you update the question with a bit more of the outerHTML? – undetected Selenium Apr 13 '19 at 18:56