11

I use Scrapy shell without problems with several websites, but I find problems when the robots (robots.txt) does not allow access to a site. How can I disable robots detection by Scrapy (ignored the existence)? Thank you in advance. I'm not talking about the project created by Scrapy, but Scrapy shell command: scrapy shell 'www.example.com'

DARDAR SAAD
  • 392
  • 1
  • 3
  • 17
  • could you share the logs you are getting when executing the shell command? – eLRuLL Nov 26 '16 at 23:05
  • Logs : http://pastebin.com/MASXrYb9 – DARDAR SAAD Nov 27 '16 at 13:47
  • logs show that you are definitely inside a Scrapy project, which means that a `settings.py` file is available – eLRuLL Nov 27 '16 at 13:55
  • Because of the robots I do not acer to "response": [s] response <200 http://azertyuiop.com> It is necessary to review the log to understand. Another thing I work with is the "Scrapy shell" command and not with Scrapy project. – DARDAR SAAD Nov 27 '16 at 14:00

2 Answers2

16

In the settings.py file of your scrapy project, look for ROBOTSTXT_OBEY and set it to False.

daniboy000
  • 1,069
  • 2
  • 16
  • 26
  • I modified the settings.py file then I run the command and the scrapy shell has price considering the change for all the others. Thanks for your solution. – DARDAR SAAD Nov 27 '16 at 16:41
10

If you run scrapy from project directory scrapy shell will use the projects settings.py. If you run outside of the project scrapy will use default settings. However you can override and add settings via --set flag.
So to turn off ROBOTSTXT_OBEY setting you can simply:

scrapy shell http://stackoverflow.com --set="ROBOTSTXT_OBEY=False"
Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • When I run this command, I have an error: http://pastebin.com/fwVsU4BB – DARDAR SAAD Nov 27 '16 at 13:39
  • Scrapy shell command checks the current spiders searching for their `allowed_domains` to match that spider attributes and custom settings for the current shell session. There could be a problem with one of those spiders. – eLRuLL Nov 27 '16 at 13:57