I am using the following script to extract data from real estate website:
for publish in soup.find('ul', {'class': 'list-view real-estates'}).find_all('span', {'class': 're-offer-type'}):
publish_value = publish.get_text().strip()
publisher.append(publish_value)
I need to extract the agency that is offering the property (look at image):
The using the above code extract the data very well, the problem is that another has a class = re-offer-type. Please look at image below:
This is a link to the website page: https://www.imoti.net/bg/obiavi/r/prodava/bulgaria/?page=1&sid=iXMpXe
To the problem. Look at image:
I need the text in red, but I somethimes, I will stress on that somethimes I get the text in purple. Therefore I need to me more speicific, however I don't know how to specify the second span tag and why I sometimes get the value that I want and sometimes I get the purple value.
Any suggestions? This is the result I get based on code above:
['продава', 'ЦКБ АД', 'продава', 'ЦКБ АД', 'продава', 'ЦКБ АД', 'продава', 'частно лице', 'продава', 'ЦКБ АД', 'продава', 'Константинов Реал Естейт', 'продава', 'Вариант', 'продава', 'Luximmo Finest Estates', 'продава', 'Тийм Визия', 'продава', 'частно лице', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'частно лице', 'продава', 'Dekris', 'продава', 'частно лице', 'продава', 'ЦКБ АД', 'продава', 'ЦКБ АД', 'продава', 'Нерро недвижими имоти', 'продава', 'Имот Експрес 99', 'продава', 'АВАНГАРД НЕДВИЖИМИ ИМОТИ', 'продава', 'Premier Estates', 'продава', 'частно лице', 'продава', 'Premier Estates', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'АВАНГАРД НЕДВИЖИМИ ИМОТИ', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'Нерро недвижими имоти', 'продава', 'АВАНГАРД НЕДВИЖИМИ ИМОТИ', 'продава', 'BULGARIAN PROPERTIES']
The word 'продава' means selling and should not be present. As you can see there are a lot of 'продава' strings, but not all of them are wrong which is very strange.
Full code:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import re
import numpy as np
s = HTMLSession()
url = 'https://www.imoti.net/bg/obiavi/r/prodava/bulgaria/?page=1&sid=iXMpXe'
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
prices = []
type_of_property = []
sqm_area = []
locations =[]
publisher = []
def get_prices(urls):
for price in soup.find('ul', {'class': 'list-view real-estates'}).find_all('strong', {'class': 'price'}):
price_text = price.get_text()
price_arr = re.findall('[0-9]+', price_text)
final_price = ''
for each_sub_price in price_arr:
final_price += each_sub_price
prices.append(final_price)
for property_type in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
property_type_value = ' '.join(property_type.get_text().split(',')[0].split()[1:3])
type_of_property.append(property_type_value)
for sqm in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
sqm_value = sqm.get_text().split(',')[1].split()[0]
sqm_area.append(sqm_value)
for location in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
location_value = location.get_text().split(',')[-1].strip()
locations.append(location_value)
for publish in soup.find('ul', {'class': 'list-view real-estates'}).find_all('span', {'class': 're-offer-type'}):
publish_value = publish.get_text().strip()
publisher.append(publish_value)
return prices, type_of_property, sqm_area, locations, publisher
print(get_prices(url))