I use Scrapy shell without problems with several websites, but I find problems when the robots (robots.txt) does not allow access to a site.
How can I disable robots detection by Scrapy (ignored the existence)?
Thank you in advance.
I'm not talking about the project created by Scrapy, but Scrapy shell command: scrapy shell 'www.example.com'
Asked
Active
Viewed 1.2k times
11

DARDAR SAAD
- 392
- 1
- 3
- 17
-
could you share the logs you are getting when executing the shell command? – eLRuLL Nov 26 '16 at 23:05
-
Logs : http://pastebin.com/MASXrYb9 – DARDAR SAAD Nov 27 '16 at 13:47
-
logs show that you are definitely inside a Scrapy project, which means that a `settings.py` file is available – eLRuLL Nov 27 '16 at 13:55
-
Because of the robots I do not acer to "response": [s] response <200 http://azertyuiop.com> It is necessary to review the log to understand. Another thing I work with is the "Scrapy shell" command and not with Scrapy project. – DARDAR SAAD Nov 27 '16 at 14:00
2 Answers
16
In the settings.py file of your scrapy project, look for ROBOTSTXT_OBEY and set it to False.

daniboy000
- 1,069
- 2
- 16
- 26
-
I modified the settings.py file then I run the command and the scrapy shell has price considering the change for all the others. Thanks for your solution. – DARDAR SAAD Nov 27 '16 at 16:41
10
If you run scrapy from project directory scrapy shell
will use the projects settings.py
. If you run outside of the project scrapy will use default settings. However you can override and add settings via --set
flag.
So to turn off ROBOTSTXT_OBEY
setting you can simply:
scrapy shell http://stackoverflow.com --set="ROBOTSTXT_OBEY=False"

Granitosaurus
- 20,530
- 5
- 57
- 82
-
When I run this command, I have an error: http://pastebin.com/fwVsU4BB – DARDAR SAAD Nov 27 '16 at 13:39
-
Scrapy shell command checks the current spiders searching for their `allowed_domains` to match that spider attributes and custom settings for the current shell session. There could be a problem with one of those spiders. – eLRuLL Nov 27 '16 at 13:57