-4

I want to scrape some data from a website which robots.txt file contain this, is not this means I can scrape from anywhere but wp-admin ? Also is there any other way from which I can know that website allow scraping/crawling without any blocking ? For scraping I use Python Scrapy Framework.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Mohib
  • 429
  • 1
  • 9
  • 25
  • Checking the `robots.txt` is always a good way to see if you're allowed to scrape. I would check the TOS and EULA as well. Yes, that is what the `robots.txt` means. – Morgan Thrapp Oct 04 '16 at 15:22
  • http://stackoverflow.com/questions/37274835/getting-forbidden-by-robots-txt-scrapy/37278895#37278895 – Rafael Almeida Oct 04 '16 at 16:54
  • I have not any idea about TOS and EULA, Would u please give any link or a little details, thanks a lot! @MorganThrapp – Mohib Oct 04 '16 at 17:51
  • The terms of service and end user license agreement. It's going to vary from site to site. – Morgan Thrapp Oct 04 '16 at 17:53

1 Answers1

1

in a newer version of Scrapy, new settings variable is introduced robotstxt_obey - which will follow the robots txt strictly if enabled

bu default it has value True

As mentioned in comment,doc does say default value is False but this behavior was changed in latest version of scrapy and now defaults to True

MrPandav
  • 1,831
  • 1
  • 20
  • 24
  • Didn't know that, and the value is set to True! I've not changed it. It was this way . thanks – Mohib Oct 04 '16 at 17:48
  • Yes, the value defaults to True for every new scrapy project created via `scrapy startproject` since the recent scrapy versions. – Granitosaurus Oct 05 '16 at 10:27
  • yes , Now it does default to """ROBOTSTXT_OBEY = True""", and doc doesn't reflect the latest change, have raised PR on Github project for the same – MrPandav Oct 06 '16 at 06:01