2

I am trying to scrape this web page: link

I have tried several recommendations made in Python ( change proxy, change agent, etc.) from question of Stack Overflow, but It has not been possible to get status code 200.

This is my last code:

url<-"https://www.idealista.com/venta-viviendas/madrid-provincia/"
GET(url,add_headers("accept"= "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
                    "accept-encoding"= "gzip, deflate, br",
                    "accept-language"= "es-ES,es;q=0.9,en;q=0.8",
                    "sec-fetch-mode"= "navigate",
                    "sec-fetch-site"= "none",
                    "sec-fetch-user"= "?1",
                    "upgrade-insecure-requests"= "1",
                    "user-agent"= "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"))

I added all the headers of the web page but I always get the same.

M--
  • 25,431
  • 8
  • 61
  • 93
Henry Navarro
  • 943
  • 8
  • 34

1 Answers1

0

I think I ran into a similar problem a while back. If I remember correctly, when I visited the site in my browser, it generated a few cookies. A couple of those cookies needed to be included in the header when I used urllib.request to scrape the site.

The problem in your case is that the site idealista.com generated a lot of cookies, as shown below in the Firefox Web Developer Storage Inspector. It may be difficult to determine which, if any, of the cookies need to be included in your header.

list of cookies

Below is the code in Python. I think you should be able to add cookies in R as well. You can probably find the cookies in a .txt file in your browser's C:\Users\Danny\AppData\ folder. That way you don't have to copy-paste all the cookies individually.

import urllib.request as req

opener = req.build_opener()
opener.addheaders.append(('Cookie', 'firstCookieName=chocolate&secondCookieName=oatmeal'))

Note: this is how I solved the 403 problem when scraping another site. Hopefully it will work for the idealista.com site, but I have no way to be sure. I have limited experience working with cookies and headers in these situations. Maybe another user can provide additional expertise. Good luck!

sam
  • 501
  • 1
  • 4
  • 11
  • 1
    I was thinking the same, let me try and I will write you in a couple of days – Henry Navarro Dec 07 '19 at 17:59
  • 1
    @DannyHern Yeah, in my case it just took a lot of screwing around and trying different combos. Sites like that typically try to make it a PITA to scrape data. I am more comfortable using R, but I think Python packages may provide more options for getting around methods which sites use to prevent scraping. You may even be able to write a program in Python to *control your web browser* to automatically click through the pages of search results and download source code. – sam Dec 07 '19 at 18:31