1

I tried to access wikipedia page from python

a = urllib2.urlopen("http://en.wikipedia.org/wiki/LALR_parser")

this caused an error

<urlopen error [Errno 101] Network is unreachable>

So I tried

req = urllib2.Request(url, headers={'User-Agent' : "MyBrowser"})
a = urllib2.urlopen(req)

Still I get the same error

Now I am unable to view wikipedia in chrome or firefox..It says 'chrome cannot find the page'

But if I type in the wikipedia url in an anonymous proxy ,the page is displayed without any problem

What do you think is the problem?Is my IP blocked? I checked firewall(in ubuntu lucid)

sudo ufw status

Status: inactive

I also tried

sudo iptables -L -n
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

can somebody please help?

damon
  • 8,127
  • 17
  • 69
  • 114
  • Looks like you had lost internet. What did `ping google.com` produce? – pydsigner Nov 16 '12 at 03:05
  • I wouldn't scrape Wikipedia like that. Just use their API: http://www.mediawiki.org/wiki/API:Main_page – Blender Nov 16 '12 at 03:07
  • @pydesigner, ping shows network is ok `64 bytes from bom04s01-in-f6.1e100.net (173.194.36.6): icmp_seq=1 ttl=57 time=116 ms 64 bytes from bom04s01-in-f6.1e100.net (173.194.36.6): icmp_seq=2 ttl=57 time=149 ms...` – damon Nov 16 '12 at 03:10
  • I highly recommend using their API rather than scraping. If I had to guess, its probably a user-agent header, along with a combination of other things. But seriously, their API is the way to go. – That1Guy Nov 16 '12 at 14:57

3 Answers3

1

Is it possible Wikipedia is blocking it? Running your supplied code raises an Exception:

urllib2.HTTPError: HTTP Error 403: Forbidden

It seems possible that Wikipedia might be blocking (simple) programmatic access to push people to use their API.

See Fetch a Wikipedia article with Python for more discussion about this problem.

Community
  • 1
  • 1
Moshe
  • 9,283
  • 4
  • 29
  • 38
  • I am wondering why it didn't work, since I am supplying a header to pass it as a browser – damon Nov 16 '12 at 03:36
1

Are you using proxy? If you are using proxy, try to add following lines to your code:

import urllib2
proxy = urllib2.ProxyHandler({'http': 'user:password@your_proxy_server:proxy_port'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
urllib2.urlopen('http://www.python.org/')
ymn
  • 2,175
  • 2
  • 21
  • 39
1

because your headers is not right ,use this to have a try :

import  urllib2
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1'}
req = urllib2.Request("http://en.wikipedia.org/wiki/LALR_parser", headers=headers)
a = urllib2.urlopen(req)
print a.read()

hope you good luck !

ylsun
  • 31
  • 4
  • still getting the same error..Are you sure about the header? I think you can pass any name as User-Agent – damon Nov 16 '12 at 04:37
  • it is probable that there is something wrong with your network configuration,or you have been blocked by wiki – ylsun Nov 16 '12 at 04:50
  • or you can ping en.wikipedia.org to check if there is something wrong – ylsun Nov 16 '12 at 04:58
  • after waiting for 15 minutes got `64 bytes from wikipedia-lb.eqiad.wikimedia.org (208.80.154.225): icmp_seq=465 ttl=51 time=368 ms 64 bytes from wikipedia-lb.eqiad.wikimedia.org (208.80.154.225): icmp_seq=466 ttl=52 time=373 ms` – damon Nov 16 '12 at 06:12
  • well the "block" seems to have gone away now..I can see wiki pages in browser – damon Nov 16 '12 at 06:14