1

So im trying to make a data science project using information from this site. But sadly when I try to scrape it, it blocks me because it thinks I am a bot. I saw a couple of post here: Python webscraping blocked but it seems that Immoscout have already found a solution to this workaround. Does somebody know how I can come around this? thanks!

My Code:

import requests
from bs4 import BeautifulSoup
import random

headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 ("
                         "KHTML, like Gecko) Version/4.0 Safari/534.30 , 'Accept-Language': 'en-US,en;q=0.5'"}


url = "https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-kaufen?enteredFrom=one_step_search"

response = requests.get(url, cookies={'required_cookie': 'reese84=xxx'} ,headers=headers)
webpage = response.content
print(response.status_code)

soup = BeautifulSoup(webpage, "html.parser")
print(soup.prettify)

thanks :)

jpwitt13
  • 57
  • 1
  • 9

1 Answers1

-1

Data is generating dynamically from API calls json response as POST method and You can extract data using only requests module.So,You can follow the next example.

import requests
headers= {
    'content-type': 'application/json',
    'x-requested-with': 'XMLHttpRequest'
   }

api_url = "https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-kaufen?pagenumber=1"

jsonData = requests.post(api_url).json()

for item in jsonData['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']:
    value=item['attributes'][0]['attribute'][0]['value'].replace('€','').replace('.',',')
    print(value)

Output:

4,350,000 
285,000 
620,000
590,000
535,000
972,500
579,000
1,399,900
325,000
749,000
290,000
189,900
361,825
199,900
299,000
195,000
1,225,000
199,000
825,000
315,000 
Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
  • That was reallllly helpful. I still don't understand why it worked with POST and not GET. My limited understanding is that POST is to send "secret/private" messages to a website and GET is to get the information. Im gonna google this more. It worked perfectly, thanks! – jpwitt13 Apr 13 '22 at 12:05
  • @jpwitt13 when we send data to the server then server also gives us feedback is post request but GET direct url meaning nothing need to change. Just make a google search about what is get and post request?Thanks – Md. Fazlul Hoque Apr 13 '22 at 12:12
  • one more question. I noticed you wrote the header dictionary, but you didn't use it. Why? Just curious – jpwitt13 Apr 14 '22 at 14:13
  • Most of the time required header is mandatory, in this case it works without header if not work then need to inject above headers. No matter. you can inject that will be better.requests.post(api_url,headers=headers).json() . You will also notice that it didn't send any payload data that they covered in the url may be that's it works without injecting headers – Md. Fazlul Hoque Apr 14 '22 at 14:23