0

I got this error, first of the kind in several days of on-and-off scraping:

mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

However, robots.txt of the site reads:

User-agent: *
Disallow:

According to this source, if the site were closed to this kind of access, robots.txt would contain Disallow: /.

Does the error still mean that I should stop scraping, or that there is another issue?

Should I try to appease the server (like making requests less frequent), or just circumvent the error by adding headers etc.?

Finally, considering the 403, is it unethical to keep scraping?

Tag
  • 51
  • 1
  • 9

1 Answers1

1

You could ignore the robots.txt and see what happens (might not be ethical, even for testing purposes). If you still get a 403, they could be blocking your IP specifically rather than adding to the robots.txt file.

You could contact the owner of the site and see if you can get their permission to override the robots.txt if you're feeling legally pinned down.

Or, like you said, ignore the robots.txt. I can't comment on the ethical ramifications because I'm not adept in that area.

Community
  • 1
  • 1
jarcobi889
  • 815
  • 5
  • 16
  • But robots.txt itself shows no restrictions whatsoever. – Tag Feb 21 '17 at 22:39
  • That's why you try overriding it once (or twice) and see if it lifts the 403 error. It's a diagnostics check. If the 403 is still there after you bypass the robots.txt, then it's possible they blocked your IP address – jarcobi889 Feb 21 '17 at 22:40
  • What if I get no error? (Sorry for this hypothetical talk, but I haven't decided to go through with the test just yet.) – Tag Feb 21 '17 at 22:52
  • No worries! If you get no error after overriding the robots.txt, then you can try altering your user agent or just Google all the other ways people get past robots.txt. Again, can't speak to the legal ramifications because I don't know enough about them. – jarcobi889 Feb 21 '17 at 22:56