74

while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/>

ERROR: No response downloaded for: https://www.netflix.com/

deepak kumar
  • 743
  • 1
  • 5
  • 4
  • Robots.txt is just a text file that the robots respect, it cannot forbid you from doing anything. Netflix has probably other obstacles for scraping. – Selcuk May 17 '16 at 12:40

3 Answers3

186

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY

ROBOTSTXT_OBEY = False

Here are the release notes

lmiguelvargasf
  • 63,191
  • 45
  • 217
  • 228
Rafael Almeida
  • 5,142
  • 2
  • 20
  • 33
4

Netflix's Terms of Use state:

You also agree not to circumvent, remove, alter, deactivate, degrade or thwart any of the content protections in the Netflix service; use any robot, spider, scraper or other automated means to access the Netflix service;

They have their robots.txt set up to block web scrapers. If you override the setting in settings.py to ROBOTSTXT_OBEY=False then you are violating their terms of use which can result in a law suit.

CubeOfCheese
  • 53
  • 1
  • 7
2

First thing you need to ensure is that you change your user agent in the request, otherwise default user agent will be blocked for sure.

Ketan Patel
  • 157
  • 1
  • 2
  • 6