0

I am writing a small program to fetch all hyperlinks from a webpage by providing a URL, but it seem like the network I am in is using proxy and it is not able to fetch .. My code:

import sys
import urllib
import urlparse

from bs4 import BeautifulSoup
def process(url):
    page = urllib.urlopen(url) 
    text = page.read()
    page.close()
    soup = BeautifulSoup(text) 
    with open('s.txt','w') as file:
        for tag in soup.findAll('a', href=True):
            tag['href'] = urlparse.urljoin(url, tag['href'])
            print tag['href']
            file.write('\n')
            file.write(tag['href'])


def main():
    if len(sys.argv) == 1:
        print 'No url !!'
        sys.exit(1)
    for url in sys.argv[1:]:
        process(url)
blueteeth
  • 3,330
  • 1
  • 13
  • 23
  • Based on your question your network may or may not have a proxy in use. Can you be a little more specific or just pass by your admins and ask? – frlan Sep 22 '15 at 08:59
  • yes , it have a proxy ,i tried at home it was working fine but when i took it to my Department to show to my teacher it dint work ...this is the error `IOError: [Errno socket error] [Errno -2] Name or service not known` – Shailang Kharsati Sep 22 '15 at 11:05
  • this is the proxy i used too connect "proxy4.nehu.ac.in:3128" how do i put it in codes in my program ..? please help , i am so stuck with it . – Shailang Kharsati Sep 22 '15 at 11:22
  • ok i will check on this and i will come back to you if i encounter some problem ..at this moment i cannot test it because i have to try it at the University itself since i dont have proxy network to test . If it ok with you? – Shailang Kharsati Sep 22 '15 at 11:56
  • you can easily set up a proxy on your own. E.g. squid is quiet popular. – frlan Sep 22 '15 at 11:57
  • OK i dint know ...thanks i will surely take a look ..thanks for you time ..cheers – Shailang Kharsati Sep 22 '15 at 12:01

2 Answers2

3

You could use the requests module instead.

import requests

proxies = { 'http': 'http://host/' } 
# or if it requires authentication 'http://user:pass@host/' instead

r = requests.get(url, proxies=proxies)
text = r.text
blueteeth
  • 3,330
  • 1
  • 13
  • 23
1

The urllib library you are using for HTTP access does not support proxy authentication (it does support un-authenticated proxies). From the docs:

Proxies which require authentication for use are not currently supported; this is considered an implementation limitation.

I suggest you switch to urllib2 and use it as demonstrated in the answer to this post.

Community
  • 1
  • 1
shevron
  • 3,463
  • 2
  • 23
  • 35
  • I am new to python so its hard for me to implement , just for the head start can u like somehow show me how should i put it in my program ..? – Shailang Kharsati Sep 22 '15 at 11:12
  • i have read in the python documentation that there is a proxyHandler in urllib2 that can handle proxy , how to i put it in such a way that it will go through the proxy i used to connect to the internet.Please help – Shailang Kharsati Sep 22 '15 at 11:30