4

This is the website I am crawling. I had no problem at first, but then I encountered this error.

[scrapy] DEBUG: Redirecting (meta refresh) to <GET https://www.propertyguru.com.my/distil_r_captcha.html?requestId=9f8ba25c-3673-40d3-bfe2-6e01460be915&httpReferrer=%2Fproperty-for-rent%2F1> from <GET https://www.propertyguru.com.my/property-for-rent/1>

Website knows I am a bot and redirects me to a page with a captcha code. I think handle_httpstatus_list or dont_redirect doesn't work because redirection isn't done with http status codes. This is my crawler's code. Is there any way to stop this redirection?

class MySpider(CrawlSpider):

    name = 'myspider'

    start_urls = [
        'https://www.propertyguru.com.my/property-for-rent/1',
    ]

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    }
    meta = {
        'dont_redirect': True
    }


    def parse(self, response):    
        items = response.css('div.header-container h3.ellipsis a.nav-link::attr(href)').getall()

        if items:
            for item in items:
                if item.startswith('/property-listing/'):
                    yield scrapy.Request(
                        url='https://www.propertyguru.com.my{}'.format(item),
                        method='GET',
                        headers=self.headers,
                        meta=self.meta,
                        callback=self.parse_items
                    )

    def parse_items(self, response):
        from scrapy.shell import inspect_response
        inspect_response(response, self)

UPDATE: I tried those settings, but they didn't work either.

custom_settings = {
    'DOWNLOAD_DELAY': 5,
    'DOWNLOAD_TIMEOUT': 360,
    'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
    'CONCURRENT_ITEMS': 1,
    'REDIRECT_MAX_METAREFRESH_DELAY': 200,
    'REDIRECT_MAX_TIMES': 40,
}
gunesevitan
  • 882
  • 10
  • 25
  • Your problem is deeper than turning off redirect for 302s. They have one of the top anti scraping technology Distil Networks. At one time I was able to get around it for some time with Selenium but eventually that ended I guess they figured me out somehow despite randomization efforts. – ThePyGuy May 21 '20 at 20:16

3 Answers3

4

This website is protected by Distil Networks. They are using JavaScript to determine you are a bot. Are they letting some requests through or none at all? You may be able to have some success with Selenium, but in my experience they will catch on eventually. The solution involves randomizing the entire browser fingerprint from screen size and everything else you can think of. If anybody else has additional info I would be interested to hear about it. I'm not sure about SoF ToS on stuff like this.

If you load up a proxy like charles proxy or something so you can see everything going on you can look at all the JS they are running on you.

If they are letting 0 requests through I'd advise using Selenium to see your luck.

If they are letting some through and redirecting others my experience is over time they will eventually redirect them all. What I would do if they are letting some through is set http_retry_codes = []

Just to expand on this some more I will link to this post about over riding your navigator object with Selenium which is what contains much of your browser fingerprint. It must be done in JS and on every page load. I can't attest to its effectiveness against Distil. See this answer

The direct answer to your question thanks to other answer for completing my answer.

#settings.py

HTTP_RETRY_CODES = [404, 303, 304, ???]
RETRY_TIMES = 20

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': None,
}

In spider meta attributes for a particular request:

meta={'dont_redirect': True}

Also it's worth noting that you can in the middleware under process_response method catch the 302 and have it throw off another request. This in combination with HTTP RETRY CODES is a good way to brute force if you have a good UA list and IP source.

I suggest you try https://scrapinghub.com/crawlera . They recently raised their prices but they supply good IPs and detect bans. It really is worth it if you need to get to certain information. Their network is smart unlike most IP rotation networks that are much cheaper. They have a trial going on so you can verify if it works and its made by the developers of scrapy so follow the documentation for easy install with

pip install scrapy_crawlera

Then you can retry all of them until your rotator gives you a good IP which I suspect you will see that over a short period of time they will all be banned.

ThePyGuy
  • 1,025
  • 1
  • 6
  • 15
  • At first, I received the responses without any problem. After a couple minutes later, the website detected I'm a bot, even though I didn't spam requests like crazy. I was testing my selectors when website started redirecting me. An hour passed now and I get the responses without any problem again, but I'm pretty sure this won't stay longer and I'll get redirected again. – gunesevitan Jul 03 '19 at 10:04
  • I'm interested if you find anything out let me know. I once worked on a project for a long time and was able to get like 40k records with selenium but was unable to complete the project once they identified my Selenium method. I try to stay away from this kind of work now. Public data only if they are using Distil chances are they dont want you to have the data and also it will just be a headache for you since anything you develop the AI will see a pattern eventually. If anybody has beaten it please contact me lol. – ThePyGuy Jul 04 '19 at 10:15
  • I have one last idea left. There could be any number of metadata used for the machine learning, but if I use random delays between downloads and requests, the classifier might think I'm a human. The time spent on the website probably won't worth it though. I'll let you know if I find anything about this :D – gunesevitan Jul 04 '19 at 10:26
  • Try it -- I did along with finger print randomization and browser randomization I also spent a random amount of time on each page – ThePyGuy Jul 04 '19 at 11:18
1

To stop meta refresh disable download middleware MetaRefreshMiddleware in project settings by setting it's value to None:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': None,
}

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#activating-a-downloader-middleware

Alexander C.
  • 719
  • 6
  • 9
  • This is an answer to the direct question but digging deeper into the actual problem of Distil is not a solution to his problem. – ThePyGuy May 21 '20 at 15:17
1

To stop meta refresh, simply disable it in the crawler settings.py file:

METAREFRESH_ENABLED = False

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#metarefreshmiddleware-settings

merlin
  • 2,717
  • 3
  • 29
  • 59