getting Forbidden by robots.txt: scrapy

Question

while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/>

ERROR: No response downloaded for: https://www.netflix.com/

Robots.txt is just a text file that the robots respect, it cannot forbid you from doing anything. Netflix has probably other obstacles for scraping. — Selcuk, May 17 '16 at 12:40

score 186 · Accepted Answer · edited Sep 01 '20 at 12:18

186

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY

ROBOTSTXT_OBEY = False

Here are the release notes

edited Sep 01 '20 at 12:18

lmiguelvargasf

63,191
45
217
228

answered May 17 '16 at 14:24

Rafael Almeida

5,142
2
20
33

score 4 · Answer 2 · answered May 23 '20 at 05:26

Netflix's Terms of Use state:

You also agree not to circumvent, remove, alter, deactivate, degrade or thwart any of the content protections in the Netflix service; use any robot, spider, scraper or other automated means to access the Netflix service;

They have their robots.txt set up to block web scrapers. If you override the setting in settings.py to ROBOTSTXT_OBEY=False then you are violating their terms of use which can result in a law suit.

score 2 · Answer 3 · answered May 17 '16 at 13:23

2

First thing you need to ensure is that you change your user agent in the request, otherwise default user agent will be blocked for sure.

answered May 17 '16 at 13:23

Ketan Patel

157
1
2
6

how do you change your user agent? – rom Aug 07 '20 at 02:48

getting Forbidden by robots.txt: scrapy

3 Answers3

Linked

Related