-3

I am trying to pull a specific URL using the the python by using raw_html = urlopen(url).read().

When I inspect 'raw_htm' I find that the expected HTML/text has been replaced with some text that essentially tells me that I cannot crawl the site.

However, when I pull the same url using 'curl -O' from UNIX/python the page is downloaded just fine.

What is the reason for the discrepancy and what method should I use within python so that I can get the html as I do with the curl command in unix?

Thanks in advance for any thoughts!

user7289
  • 32,560
  • 28
  • 71
  • 88
  • 1
    Have you considered honoring the site owner's wishes and not crawling their site? – Wooble Feb 12 '13 at 11:25
  • I would suggest reading the `robots.txt` of the website and honoring it. If it says don't crawl, then don't crawl – elssar Feb 12 '13 at 11:25
  • I am learning python and was learning how to download and parse a web page - just thought the discrepancy was curious so thought I would ask. – user7289 Feb 12 '13 at 13:25

1 Answers1

2

When an HTTP client makes a request, it identifies itself to the server. In this case, the server checks whether the client is a bot, and if it is, it refuses access (though apparently it fails to detect Curl).

You can get around this by setting the user-agent string to spoof a browser. See this question for how to do that with urllib. However, if the server's owner does not want you to crawl it, and it detects that you're doing so anyway (because you're requesting pages at too high a rate), you might find yourself blocked from accessing the site, so contacting the owner might be a better idea than spoofing.

Community
  • 1
  • 1
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Thanks for this - though still wondering how curl was able to do it, which is what I was thinking was interesting. – user7289 Feb 12 '13 at 13:37