-1

I'm trying to scrape a real-estate website and save results in a dataframe. This is the code:

# Librerías
import requests
from time import sleep
from bs4 import BeautifulSoup
import pandas as pd
from numpy import random


baseurl = 'https://www.engelvoelkers.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
}

productlinks = []

for x in range(1,36):
    r = requests.get(f'https://www.engelvoelkers.com/es/search/?q=&startIndex={x}&businessArea=residential&sortOrder=DESC&sortField=sortPrice&pageSize=18&facets=rgn%3Avalencia%3Bcntry%3Aspain%3Bbsnssr%3Aresidential%3B')
    soup = BeautifulSoup(r.content, 'lxml')
    productlist = soup.find_all('div', class_="col-lg-4 col-md-4 col-sm-6 col-xs-12")
    for item in productlist:
        for link in item.find_all('a', class_='ev-property-container', href = True):
            productlinks.append(baseurl + link['href'])


houseslist = []

for link in productlinks:
    r = requests.get(link, headers = headers)
    soup = BeautifulSoup(r.content, 'lxml')
    sleep(random.uniformn(1,3))

    name = soup.find('h1', class_='ev-exposee-title ev-exposee-headline').text.strip()
    price = soup.find('div', class_='ev-key-fact-value').text.strip()
    characteristics = soup.find_all(class_='ev-key-fact-value')
    bathrooms = characteristics[2].text.strip()
    bedrooms = characteristics[1].text.strip()
    habitable_surface = characteristics[3].text.strip()
    charac2 = soup.find_all(class_='ev-exposee-detail-fact-value')
    construction_date = charac2[2].text.strip()
    location = soup.find('div', class_='ev-exposee-content ev-exposee-subtitle').text.strip()


    house = {
        'Description': name,
        'Location': location,
        'Price': price,
        'Bedrooms': bedrooms,
        'Bathrooms': bathrooms,
        'Habitable Surface': habitable_surface,
        'Building Date': construction_date
    }

    houseslist.append(house)
    print('Saving:', house['name'])


df = pd.DataFrame(houseslist)
print(df.head(15))

And I got the following error:

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.engelvoelkers.comhttps', port=443): Max retries exceeded with url: //www.engelvoelkers.com/es-es/propiedad/increible-atico-duplex-con-piscina-en-la-alameda-3546645.1132846_exp/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000019E9B30A9D0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

I tried to add sleep time (random values and variable over loops) but still get same error.

Any idea?

PS: the structure of the code is from John Watson Rooney video from YT but I personalized for my own case.

Flair
  • 2,609
  • 1
  • 29
  • 41
juanguit
  • 11
  • 3
  • The DNS record doesn't exist, from your stack trace looks like you search for "www.engelvoelkers.comhttps" instead of "www.engelvoelkers.com" which is a problem in your code... – KafKafOwn Jan 30 '22 at 20:38
  • This might help https://stackoverflow.com/questions/18478013/python-requests-exceptions-connectionerror-max-retries-exceeded-with-url – Tharu Jan 30 '22 at 20:56

1 Answers1

0

I think the problem is you are concatenating a URL to the website name. This creates a URL where the domain ends in comhttps which isn't a valid TLD as far as I know. Try using urljoin instead.

from urllib.parse import urljoin

productlinks.append(urljoin(baseurl, link['href']))

instead of

productlinks.append(baseurl + link['href'])
Nathan Mills
  • 2,243
  • 2
  • 9
  • 15