Some servers have a robots.txt file in order to stop web crawlers from crawling through their websites. Is there a way to make a web crawler ignore the robots.txt file? I am using Mechanize for python.
Asked
Active
Viewed 1.3k times
14
-
3If you do this, there are presumably legal issues – David Heffernan Dec 05 '11 at 14:09
-
10Downvoting this is bad since it is a legit question. However this is a bad idea. – Noufal Ibrahim Dec 05 '11 at 14:14
-
2[An interesting alternative view on ignoring robots.txt](http://www.archiveteam.org/index.php?title=Robots.txt) – Acorn Dec 05 '11 at 14:42
-
1While I agree ignoring robots.txt is a bad idea what do you propose are the legal issues? – BlueVoid Dec 12 '11 at 15:09
2 Answers
29
The documentation for mechanize has this sample code:
br = mechanize.Browser()
....
# Ignore robots.txt. Do not do this without thought and consideration.
br.set_handle_robots(False)
That does exactly what you want.

David Heffernan
- 601,492
- 42
- 1,072
- 1,490
-
I suggest raising your issue on [flagging this question](http://stackoverflow.com/questions/8373398/creating-replacement-tapplication-for-experimentation) on meta yet again. There seems to be different opinions on how suspected copyright violations should be handled, and a definitive answer would help. – NullUserException Dec 05 '11 at 18:33
-
@NullUser will do. I'll try and collect together in one place all the conflicting advice I have had, and see if we can't all come to a common viewpoint! – David Heffernan Dec 05 '11 at 18:51