How to access webpages using Python via a proxy

Question

I am writing a small program to fetch all hyperlinks from a webpage by providing a URL, but it seem like the network I am in is using proxy and it is not able to fetch .. My code:

import sys
import urllib
import urlparse

from bs4 import BeautifulSoup
def process(url):
    page = urllib.urlopen(url) 
    text = page.read()
    page.close()
    soup = BeautifulSoup(text) 
    with open('s.txt','w') as file:
        for tag in soup.findAll('a', href=True):
            tag['href'] = urlparse.urljoin(url, tag['href'])
            print tag['href']
            file.write('\n')
            file.write(tag['href'])


def main():
    if len(sys.argv) == 1:
        print 'No url !!'
        sys.exit(1)
    for url in sys.argv[1:]:
        process(url)

Based on your question your network may or may not have a proxy in use. Can you be a little more specific or just pass by your admins and ask? — frlan, Sep 22 '15 at 08:59
yes , it have a proxy ,i tried at home it was working fine but when i took it to my Department to show to my teacher it dint work ...this is the error `IOError: [Errno socket error] [Errno -2] Name or service not known` — Shailang Kharsati, Sep 22 '15 at 11:05
this is the proxy i used too connect "proxy4.nehu.ac.in:3128" how do i put it in codes in my program ..? please help , i am so stuck with it . — Shailang Kharsati, Sep 22 '15 at 11:22
ok i will check on this and i will come back to you if i encounter some problem ..at this moment i cannot test it because i have to try it at the University itself since i dont have proxy network to test . If it ok with you? — Shailang Kharsati, Sep 22 '15 at 11:56
you can easily set up a proxy on your own. E.g. squid is quiet popular. — frlan, Sep 22 '15 at 11:57
OK i dint know ...thanks i will surely take a look ..thanks for you time ..cheers — Shailang Kharsati, Sep 22 '15 at 12:01

blueteeth · Accepted Answer · 2021-02-21T02:31:39.487

3

You could use the requests module instead.

import requests

proxies = { 'http': 'http://host/' } 
# or if it requires authentication 'http://user:pass@host/' instead

r = requests.get(url, proxies=proxies)
text = r.text

edited Feb 21 '21 at 02:31

answered Sep 22 '15 at 11:49

blueteeth

3,330
1
13
23

should i put it this way `proxies = { 'http': 'http://proxya4.nehu.ac.in }` – Shailang Kharsati Sep 22 '15 at 11:52
You need the port and closing quote. So it would be `proxies = { 'http': 'http://proxya4.nehu.ac.in:3128' }` – blueteeth Sep 22 '15 at 11:59
Can i come back to you later i will try first an let u know how it goes?..i really want this to work ..im like crying inside so bad. – Shailang Kharsati Sep 22 '15 at 12:03
Hi, i tried your suggestion i got 'response 200' when i print `r=requests.get("http://www.dota2.com",proxies=poxies)` what does it means. – Shailang Kharsati Sep 23 '15 at 09:04
200 is the status code for the response. It is saying the response was ok. [1] To get the html from the page, you need to print `r.text`. [1]: http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html – blueteeth Sep 29 '15 at 21:45
Its working Thanks alot ...sorry for late reply – Shailang Kharsati Sep 30 '15 at 05:45
I thought this was working for me, but tried putting random information passed in with proxies and data was still retrieved each time (as long as https was used). – ballade4op52 Aug 31 '16 at 20:51

score 1 · Answer 2 · edited May 23 '17 at 12:09

1

The urllib library you are using for HTTP access does not support proxy authentication (it does support un-authenticated proxies). From the docs:

Proxies which require authentication for use are not currently supported; this is considered an implementation limitation.

I suggest you switch to urllib2 and use it as demonstrated in the answer to this post.

edited May 23 '17 at 12:09

Community

1
1

answered Sep 22 '15 at 08:56

shevron

3,463
2
23
35

I am new to python so its hard for me to implement , just for the head start can u like somehow show me how should i put it in my program ..? – Shailang Kharsati Sep 22 '15 at 11:12
i have read in the python documentation that there is a proxyHandler in urllib2 that can handle proxy , how to i put it in such a way that it will go through the proxy i used to connect to the internet.Please help – Shailang Kharsati Sep 22 '15 at 11:30

How to access webpages using Python via a proxy

2 Answers2