2

I want to webscrape german real estate website immobilienscout24.de. I would like to download the HTML of a given URL and then work with the HTML offline. It is not intended for commercial use or publication and I do not intend on spamming the site, it is merely for coding practice. I would like to write a python tool that automatically downloads the HTML of given immobilienscout24.de sites. I have tried to use beautifulsoup for this, however, the parsed HTML doesn't show the content but asks if I am a robot etc., meaning my webscraper got detected and blocked (I can access the site in Firefox just fine). I have set a referer, a delay and a user agent. What else can I do to avoid being detected (i.e. rotating proxies, rotating user agents, random clicks, other webscraping tools that don't get detected...)? I have tried to use my phones IP but got the same result. A GUI webscraping tool is not an option as I need to control it with python. Please give some implementable code if possible. Here is my code so far:

import urllib.request
from bs4 import BeautifulSoup
import requests
import time
import numpy

url = "https://www.immobilienscout24.de/Suche/de/wohnung-mieten?sorting=2#"
req = urllib.request.Request(url, data=None, headers={ 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36' })
req.add_header('Referer', 'https://www.google.de/search?q=immoscout24)
delays = [3, 2, 4, 6, 7, 10, 11, 17]
time.sleep(numpy.random.choice(delays)) # I want to implement delays like this
page = urllib.request.urlopen(req)
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify)

username:~/Desktop$ uname -a
Linux username 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Thank you!

Sahli9876
  • 33
  • 1
  • 7
  • I have tried several more things suggested to me here, some of them work, but only for <3 tries. If I try to access the site more than a few times (even with delays) I get blocked. Any suggestions as to what I might be able to do? – Sahli9876 Nov 02 '20 at 23:09

4 Answers4

7

I'm the developer of Fredy (https://github.com/orangecoding/fredy). I came across the same issue. After digging into the issue, I found how they check if you're a robot.

First they set a localstorage value.

localstorageAvailable: true

And if it's available, they set a value:

testLocalStorage: 1

If both worked, a cookie is set called reese84=xxx. This is what you want. If you send this cookie with your request, it should work. I've tested it a few times.

Note: This is not yet implemented in Fredy, thus immoscout still doesn't work on the live source as I'm currently rewriting the code.

Christian
  • 6,961
  • 10
  • 54
  • 82
  • This sounds like it could be the solution. But I don't know how to implement that in my code. Could you perhaps, if you have the time, show me how to implement the localstorage values and receiving/sending this cookie with my request? I am not certain how to implement this with python/beautifulsoup as I have not worked with cookies before. Thanks so much! – Sahli9876 Jan 11 '21 at 13:09
  • Well, a quick google search could help you ;) See https://stackoverflow.com/questions/51682341/how-to-send-cookies-with-urllib – Christian Jan 12 '21 at 07:49
3

Try to set Accept-Language HTTP header (this worked for me to get correct response from server):

import requests
from bs4 import BeautifulSoup

url = "https://www.immobilienscout24.de/Suche/de/wohnung-mieten?sorting=2#"

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0',
    'Accept-Language': 'en-US,en;q=0.5'
}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for h5 in soup.select('h5'):
    print(h5.get_text(strip=True, separator=' '))

Prints:

NEU Albertstadt: Praktisch geschnitten und großer Balkon
NEU Sehr geräumige 3-Raum-Wohnung in der Eisenacher Weststadt
NEU Gepflegte 3-Zimmer-Wohnung am Rosenberg in Hofheim a.Taunus
NEU ERSTBEZUG: Wohnung neu renoviert
NEU Freundliche 3,5-Zimmer-Wohnung mit Balkon und EBK in Rheinfelden
NEU Für Singles und Studenten! 2 ZKB mit EBK und Balkon
NEU Schöne 3-Zimmer-Wohnung mit 2 Balkonen im Parkend
NEU Öffentlich geförderte 3-Zimmer-Neubau-Wohnung für die kleine Familie in Iserbrook!
NEU Komfortable, neuwertige Erdgeschosswohnung in gefragter Lage am Wall
NEU Möbliertes, freundliches Appartem. TOP LAGE, S-Balkon, EBK, ruhig, Schwabing Nord/, Milbertshofen
NEU Extravagant & frisch saniert! 2,5-Zimmer DG-Wohnung in Duisburg-Neumühl
NEU wunderschöne 3 Zimmer Dachgeschosswohnung mit Einbauküche. 2er WG-tauglich.
NEU Erstbezug nach Sanierung: Helle 3-Zimmer-Wohnung mit Balkon in Monheim am Rhein
NEU Morgen schon im neuen Zuhause mit der ganzen Familie! 3,5 Raum zur Miete in DUI-Overbruch
NEU Erstbezug: ansprechende 2-Zimmer-EG-Wohnung in Bad Düben
NEU CALENBERGER NEUSTADT | 3-Zimmer-Wohnung mit großem Süd-Balkon
NEU Wohnen und Arbeiten in Bestlage von HH-Lokstedt !
NEU Erstbezug: Wohlfühlwohnen in modernem Dachgeschoss nach kompletter Sanierung!
NEU CASACONCEPT Stilaltbau-Wohnung München-Bogenhausen nahe Prinzregentenplatz
NEU schöne Wohnung mit Balkon und Laminatboden
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    Works perfectly. Just checked it. – SIM Nov 02 '20 at 19:43
  • 1
    Like my answer below, this works for a few requests. After requesting a few URLs (or the same URL several times), it asks if you're a robot (I just tried it out). – PApostol Nov 03 '20 at 16:27
3

Coming back to this question after a while...

For your information, I brought back support for Immoscout in Fredy. Have a look here. https://github.com/orangecoding/fredy#immoscout

Christian
  • 6,961
  • 10
  • 54
  • 82
0

Maybe have a go with requests, the code below seems to work fine for me:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.immobilienscout24.de/')

soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify)

Another approach is to use selenium; it's powerful but maybe a bit more complicated.

Edit:

A possible solution using Selenium (it seems to work for me for the link you provided in the comment):

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome('path/to/chromedriver') # it can also work with Firefox, Safari, etc
driver.get('some_url')
soup = BeautifulSoup(driver.page_source, 'html.parser')

If you haven't used selenium before, have a look here first on how to get started.

PApostol
  • 2,152
  • 2
  • 11
  • 21
  • It works for me as well with www.immobilienscout24.de/ but not with any other page on the domain, such as when I look for an apartment (https://www.immobilienscout24.de/Suche/de/baden-wuerttemberg/heidelberg/wohnung-kaufen?enteredFrom=one_step_search) - anything I can do there? Thanks! – Sahli9876 Nov 02 '20 at 15:50
  • I have added some code for selenium, it seems to work for the URL in your comment! – PApostol Nov 02 '20 at 17:25
  • Thanks for the code! I have tried to use it and it worked once, after which it stopped working. Do you perhaps have any other ideas? – Sahli9876 Nov 02 '20 at 18:11
  • Can you explain what you mean by "it worked once and then it stopped working"? If you run it a second time does it not work? – PApostol Nov 02 '20 at 19:55
  • Exactly that is what I mean... After running it once, every subsequent run results in the same blocking and anti-spam website content... Could you try scraping the content of the site, say, 10 times? Does it work for you? Thanks for your time! – Sahli9876 Nov 02 '20 at 22:59
  • Yep, you're right, it seems to eventually realize that the scraping is automated and it's asking if I'm a robot. Some websites have sophisticated algorithms to tell if you're using a bot (e.g. they can look at your mouse movement patterns) and tricking them is not trivial. I am not aware of any workarounds, but [this](https://stackoverflow.com/questions/55501524/how-does-recaptcha-3-know-im-using-selenium-chromedriver) post might assist you. Make sure to update your question if you find a solution, I'm also curious about this! – PApostol Nov 03 '20 at 16:19