Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Question

Is there a way to get around the following?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.

I'm using mechanize and BeautifulSoup on Python2.6.

hoping for a work-around

There are probably legal issues if you plan to monetize, but if you don't, continue as you please. Long live scroogle. — Stefan Kendall, May 17 '10 at 00:44

score 241 · Answer 1 · answered Oct 03 '10 at 13:02

241

oh you need to ignore the robots.txt

br = mechanize.Browser()
br.set_handle_robots(False)

answered Oct 03 '10 at 13:02

Yuda Prawira

12,075
10
46
54

1

That's exactly what I was looking for. – José Ramírez May 06 '16 at 19:48
wow this still works! I wonder if this is illegal in any way. – Samuel Dominguez Nov 24 '19 at 23:13

score 16 · Accepted Answer · answered May 17 '10 at 00:40

16

You can try lying about your user agent (e.g., by trying to make believe you're a human being and not a robot) if you want to get in possible legal trouble with Barnes & Noble. Why not instead get in touch with their business development department and convince them to authorize you specifically? They're no doubt just trying to avoid getting their site scraped by some classes of robots such as price comparison engines, and if you can convince them that you're not one, sign a contract, etc, they may well be willing to make an exception for you.

A "technical" workaround that just breaks their policies as encoded in robots.txt is a high-legal-risk approach that I would never recommend. BTW, how does their robots.txt read?

answered May 17 '10 at 00:40

Alex Martelli

854,459
170
1,222
1,395

Their robots.txt only disallows "/reviews/reviews.asp" - is this what you are scraping? – fmark May 17 '10 at 02:43
Thanks Alex, I agree... after reading more about robots.txt, this is the best approach. Cheers... @fmark i'm scraping off the video portion... http://video.barnesandnoble.com/robots.txt – Diego May 18 '10 at 00:38
14

robots.txt is not legally binding. (http://www.nytimes.com/2005/07/13/technology/13suit.html?ex=1278907200&en=377b4f3f0d459300&ei=5090&partner=rssuserland&emc=rss) – markwatson May 02 '11 at 00:54
In the US that may be right (the result of the law suit isn't given and the people giving their opinions may not be a representative sample anyway), but laws vary considerably across the world. In the UK it may well be a criminal offence to do what is being asked since it may well be contrary to s.1 of the Computer Misuse Act 1990. This may not be a problem for Diego, but I would counsel caution. – Francis Davey Jan 27 '14 at 20:07

score 13 · Answer 3 · answered Apr 20 '17 at 22:16

13

The code to make a correct request:

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
resp = br.open(url)
print resp.info()  # headers
print resp.read()  # content

answered Apr 20 '17 at 22:16

Vlad

1,541
1
21
27

1

The only answer that explains - how do we set headers along with disabling `robots.txt` handling. – markroxor Jan 03 '18 at 06:20

score 5 · Answer 4 · edited Mar 16 '16 at 15:21

5

Mechanize automatically follows robots.txt, but it can be disabled assuming you have permission, or you have thought the ethics through ..

Set a flag in your browser:

browser.set_handle_equiv(False)

This ignores robots.txt.

Also, make sure you throttle your requests, so you don't put too much load on their site. (Note, this also makes it less likely that they will detect and ban you).

edited Mar 16 '16 at 15:21

albert

8,112
3
47
63

answered May 17 '10 at 01:16

wisty

6,981
1
30
29

Hey wisty, what do you mean by throttle your requests? – Diego May 18 '10 at 00:39
I mean, set a small timeout after each request (i.e. time.sleep(1)), and don't use many threads. I'd use a few threads (in case some get bogged down), and a few seconds sleep. – wisty May 18 '10 at 01:21
1

this didn't work with the current version of mechanize – Walrus the Cat Oct 24 '14 at 13:31

score 3 · Answer 5 · answered Jul 11 '10 at 23:17

The error you're receiving is not related to the user agent. mechanize by default checks robots.txt directives automatically when you use it to navigate to a site. Use the .set_handle_robots(false) method of mechanize.browser to disable this behavior.

score 1 · Answer 6 · answered May 17 '10 at 00:39

Set your User-Agent header to match some real IE/FF User-Agent.

Here's my IE8 useragent string:

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; AskTB5.6)

score 0 · Answer 7 · answered May 17 '10 at 00:40

0

Without debating the ethics of this you could modify the headers to look like the googlebot for example, or is the googlebot blocked as well?

answered May 17 '10 at 00:40

Steve Robillard

13,445
3
36
32

I don't see any _ethical_ problem but the _legal_ ones could get even worse (whoever you're impersonating could detect you and sue the expletive-deleted out of you, not just B&N and your ISP). "Do this illegal thing and just don't get caught" isn't prudent advice, even when no ethical issues pertain (and, I repeat, I don't see anything _immoral_ in breaking these particular laws -- it's just too risky for far too little potential gain;-). – Alex Martelli May 17 '10 at 00:51
A legal issue is an ethical issue in this case do you follow it or not. – Steve Robillard May 17 '10 at 00:53

score 0 · Answer 8 · answered May 17 '10 at 00:41

0

As it seems, you have to do less work to bypass robots.txt, at least says this article. So you might have to remove some code to ignore the filter.

answered May 17 '10 at 00:41

BrunoLM

97,872
84
296
452

That article is more about custom code to scrape websites. If you are using some library, the library might be already respecting robots.txt. – Niyaz Oct 16 '12 at 04:39

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

8 Answers8

Linked

Related