I want to webscrape german real estate website immobilienscout24.de. I would like to download the HTML of a given URL and then work with the HTML offline. It is not intended for commercial use or publication and I do not intend on spamming the site, it is merely for coding practice. I would like to write a python tool that automatically downloads the HTML of given immobilienscout24.de sites. I have tried to use beautifulsoup for this, however, the parsed HTML doesn't show the content but asks if I am a robot etc., meaning my webscraper got detected and blocked (I can access the site in Firefox just fine). I have set a referer, a delay and a user agent. What else can I do to avoid being detected (i.e. rotating proxies, rotating user agents, random clicks, other webscraping tools that don't get detected...)? I have tried to use my phones IP but got the same result. A GUI webscraping tool is not an option as I need to control it with python. Please give some implementable code if possible. Here is my code so far:
import urllib.request
from bs4 import BeautifulSoup
import requests
import time
import numpy
url = "https://www.immobilienscout24.de/Suche/de/wohnung-mieten?sorting=2#"
req = urllib.request.Request(url, data=None, headers={ 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36' })
req.add_header('Referer', 'https://www.google.de/search?q=immoscout24)
delays = [3, 2, 4, 6, 7, 10, 11, 17]
time.sleep(numpy.random.choice(delays)) # I want to implement delays like this
page = urllib.request.urlopen(req)
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify)
username:~/Desktop$ uname -a
Linux username 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Thank you!