0

I want to crawl from this page from Amazon, but scrapy always direct me to another page whose images are smaller than the first page. Thus I want to stop redirecting and crawl the first page.

After some search, I found this answer. But when I changed my code like this:

yield Request(item['link'],meta = {
                  'dont_redirect': True,
                  'handle_httpstatus_list': [301,302]
              }, callback=self.parse)

Tough it stops redirecting, but it doesn't parse the first page either! My log is like this:

2015-10-20 14:56:40 [scrapy] INFO: Scrapy 1.0.3 started (bot: amazon)
2015-10-20 14:56:40 [scrapy] INFO: Optional features available: ssl, http11
2015-10-20 14:56:40 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'amazon.spiders', 'SPIDER_MODULES': ['amazon.spiders'], 'BOT_NAME': 'amazon'}
2015-10-20 14:56:40 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-20 14:56:40 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-20 14:56:40 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-20 14:56:40 [scrapy] INFO: Enabled item pipelines: AmazonPipeline
2015-10-20 14:56:40 [scrapy] INFO: Spider opened
2015-10-20 14:56:40 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-20 14:56:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-20 14:56:41 [scrapy] DEBUG: Crawled (301) <GET http://www.amazon.com/s/ref=sr_il_ti_computers?rh=n%3A172282%2Cn%3A!493964%2Cn%3A541966%2Cn%3A565108%2Cp_n_size_browse-bin%3A7817231011&ie=UTF8&qid=1445324149&lo=computers> (referer: None)
2015-10-20 14:56:41 [scrapy] INFO: Closing spider (finished)
2015-10-20 14:56:41 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 361,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 423,
 'downloader/response_count': 1,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 20, 6, 56, 41, 743646),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 10, 20, 6, 56, 40, 868227)}

Anyone has any idea about this? Thx!

Community
  • 1
  • 1
Demonedge
  • 1,363
  • 4
  • 18
  • 33
  • If the server decides to send you a redirect header without returning an actual body, then of course avoiding the redirect will make you not see anything. If you find Amazon redirecting your scraper all the time, then you should try to figure out what makes them do that. – poke Oct 20 '15 at 07:15
  • @poke Thanks, I really have no idea why the server keeps doing this, the first page can be viewed in the browser with no trouble. – Demonedge Oct 20 '15 at 07:17
  • Servers often look at the HTTP headers that are sent in the request. You can try sending the same stuff that your browser sends to emulate it completely; and then remove one header at a time until you figure out which header makes the servers react in that way. – poke Oct 20 '15 at 07:19
  • @poke I am sending the full url my browser sends to the server now, but it redirects to another page automatically. – Demonedge Oct 20 '15 at 07:21
  • I’m talking about the [HTTP header](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields), not the URL. – poke Oct 20 '15 at 07:23

1 Answers1

0

What function are you calling with callback=self.parse?

  • You should post your Spider type and parse function here. Some Spiders won't work if you overwrite their default parse function. –  Oct 26 '15 at 09:10