1

I am using the following script to extract data from real estate website:

for publish in soup.find('ul', {'class': 'list-view real-estates'}).find_all('span', {'class': 're-offer-type'}):
    publish_value = publish.get_text().strip()
    publisher.append(publish_value)

I need to extract the agency that is offering the property (look at image):

The using the above code extract the data very well, the problem is that another has a class = re-offer-type. Please look at image below:

This is a link to the website page: https://www.imoti.net/bg/obiavi/r/prodava/bulgaria/?page=1&sid=iXMpXe

To the problem. Look at image:

I need the text in red, but I somethimes, I will stress on that somethimes I get the text in purple. Therefore I need to me more speicific, however I don't know how to specify the second span tag and why I sometimes get the value that I want and sometimes I get the purple value.

Any suggestions? This is the result I get based on code above:

['продава', 'ЦКБ АД', 'продава', 'ЦКБ АД', 'продава', 'ЦКБ АД', 'продава', 'частно лице', 'продава', 'ЦКБ АД', 'продава', 'Константинов Реал Естейт', 'продава', 'Вариант', 'продава', 'Luximmo Finest Estates', 'продава', 'Тийм Визия', 'продава', 'частно лице', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'частно лице', 'продава', 'Dekris', 'продава', 'частно лице', 'продава', 'ЦКБ АД', 'продава', 'ЦКБ АД', 'продава', 'Нерро недвижими имоти', 'продава', 'Имот Експрес 99', 'продава', 'АВАНГАРД НЕДВИЖИМИ ИМОТИ', 'продава', 'Premier Estates', 'продава', 'частно лице', 'продава', 'Premier Estates', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'АВАНГАРД НЕДВИЖИМИ ИМОТИ', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'Нерро недвижими имоти', 'продава', 'АВАНГАРД НЕДВИЖИМИ ИМОТИ', 'продава', 'BULGARIAN PROPERTIES']

The word 'продава' means selling and should not be present. As you can see there are a lot of 'продава' strings, but not all of them are wrong which is very strange.

Full code:

from requests_html import HTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import re
import numpy as np

s = HTMLSession()
url = 'https://www.imoti.net/bg/obiavi/r/prodava/bulgaria/?page=1&sid=iXMpXe'

r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

prices = []
type_of_property = []
sqm_area = []
locations =[]
publisher = []

def get_prices(urls):
    for price in soup.find('ul', {'class': 'list-view real-estates'}).find_all('strong', {'class': 'price'}):
        price_text = price.get_text()
        price_arr = re.findall('[0-9]+', price_text)
        final_price = ''
        for each_sub_price in price_arr:
            final_price += each_sub_price
        prices.append(final_price)
    for property_type in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
        property_type_value = ' '.join(property_type.get_text().split(',')[0].split()[1:3])
        type_of_property.append(property_type_value)
    for sqm in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
        sqm_value = sqm.get_text().split(',')[1].split()[0]
        sqm_area.append(sqm_value)
    for location in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
        location_value = location.get_text().split(',')[-1].strip()
        locations.append(location_value)
    for publish in soup.find('ul', {'class': 'list-view real-estates'}).find_all('span', {'class': 're-offer-type'}):
        publish_value = publish.get_text().strip()
        publisher.append(publish_value)
    return prices, type_of_property, sqm_area, locations, publisher

print(get_prices(url))
maij
  • 4,094
  • 2
  • 12
  • 28
tsetsko
  • 43
  • 7
  • Is this [1::2] taken from beautifulsoup or is used in python. I haven't seen this before. – tsetsko Nov 04 '21 at 12:24
  • read [Understanding slice notation](https://stackoverflow.com/questions/509211/understanding-slice-notation), it is just a simple `slice`, it is how you get items in a list for example (and `.find_all` returns a list), just core python – Matiiss Nov 04 '21 at 12:27
  • Your comment is the answer to my question. Can you post is as a solution, so that I can vote on it and mark it as a solution? – tsetsko Nov 04 '21 at 13:13

2 Answers2

1

Since you need to get only every second span tag, you can use slice notation to get a list with every second element ([1::2], start from the second and go to next one in steps of two) so in your code it could look something like this (I also moved part of the finder to a separate line so that the line is more readable and not that long):

real_estates = soup.find('ul', {'class': 'list-view real-estates'})
for publish in real_estates.find_all('span', {'class': 're-offer-type'})[1::2]:
    publish_value = publish.get_text().strip()
    publisher.append(publish_value)

Also seemingly you can just place the real_estates somewhere before the loop and then instead of rewriting the soup.find('ul'...) just use real_estates as shown below:

def get_prices(urls):
    real_estates = soup.find('ul', {'class': 'list-view real-estates'})
    for price in real_estates.find_all('strong', {'class': 'price'}):
        ...
    for property_type in real_estates.find_all('div', {'class': 'inline-group'}):
        ...
    ...

Useful:

Matiiss
  • 5,970
  • 2
  • 12
  • 29
0
all_spans_with_re_offer_class = soup.find_all('span', class_="re-offer-type")
for span_element in all_spans_with_re_offer_class:
   if span_element.parent.name == 'h3':
        print("this is the purple text")
   else:
       publisher.append(span_element.text)

The "all_spans_with_re_offer_class " will get you all the spans with the class re-offer-types (the purple and the red). In the html we can see that the purple element that you don't need have a h3 parent, this is how you can filter it in the for loop.