How to prevent 301 redirect for web crawler

Question

I'm fairly new to web scraping, and am just testing it out on a few web pages. I've successfully scraped several Amazon searches, however in this case I get a 301 redirect, causing a different page to be scraped.

I've tried adding a line (handle_httpstatus_list = [301]) to prevent the redirect. This causes no data to be scraped at all.

On reading the documentation for scrapy, I thought perhaps editing the middlewares could solve this problem? However, was still unsure about how to go about doing this.

import scrapy


class BooksSpider(scrapy.Spider):
    name = 'books'
    handle_httpstatus_list = [301]

    start_urls = ['https://www.amazon.com/s?i=stripbooks&rh=n%3A2%2Cp_30%3AIndependently+published%2Cp_n_feature_browse-bin%3A2656022011&s=daterank&Adv-Srch-Books-Submit.x=50&Adv-Srch-Books-Submit.y=10&field-datemod=8&field-dateop=During&field-dateyear=2019&unfiltered=1&ref=sr_adv_b']

    def parse(self, response):
        SET_SELECTOR = '.s-result-item'
        for car in response.css(SET_SELECTOR):

            NAME = '.a-size-medium ::text'
            TITLE = './/h2/a/span/text()'
            LINK = './/h2/a/@href'
            yield {
                'name': car.css(NAME).extract(),
                'title': car.xpath(TITLE).extract(),
                'link': car.xpath(LINK).get()
            }

        NEXT_PAGE_SELECTOR = '.a-last a ::attr(href)'
        next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
        next_page = response.urljoin(next_page)
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback=self.parse
            )

The server is returning a 301 redirect, you either have to follow the redirect or do nothing. There is no way to "prevent" the redirect, the server is not returning anything other than the redirect reponse — Iain Shelvington, Aug 04 '19 at 13:04
Does this mean it is impossible to scrape data from this web page? At least using Scrapy — Raj Gupta, Aug 04 '19 at 13:09
It means that however you are crawling that page will not work. A lot of sites will implement measures to prevent exactly the sort of thing you are doing — Iain Shelvington, Aug 04 '19 at 13:14
it means the page moved or is a placeholder to get you to the real page. as said before you are only getting back the info to go to the other page. there is no data being returned for the page in the response. — LhasaDad, Aug 04 '19 at 13:29

score 0 · Answer 1 · answered Aug 04 '19 at 14:18

I'm sorry about the broad answer I'm giving here, but since you don't provided much information nor the stack trace of your crawler, I will try to cover what I think is a very likely scenario why you're having this problem, and give you pointers on those directions.

What's most likely happening is than the website is looking for some condition to be met (a wrong page, or cookies, or user-agent, a referrer, request headers), in case you are having a problem of session//cookie management, please refer to this post here about that topic.

Also, given than your already identified a redirect, please take a look on handling redirects, and also check the usage of middlewares to handle behaviors in your scraper.

If by any chance you're having issues with your request headers or the user-agent setting, here you can find better information about the user-agent and settings in general, or check the response object structure to create one that fits your scenario.

Obviously, never forget to check the official documentation for broader information on any package, they are very useful.

How to prevent 301 redirect for web crawler

1 Answers1