2

I've created a script in python using urllib.request applying https proxy within it. I've tried like the following but it encounters different types of issues, as in urllib.error.URLError: <urlopen error [WinError 10060] A connection attempt failed----. The script is supposed to grab the ip address from that site. The ip address used in the script is a placeholder. I've complied with the suggestion made here.

First attempt:

import urllib.request
from bs4 import BeautifulSoup

url = 'https://whatismyipaddress.com/proxy-check'

headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
proxy_host = '60.191.11.246:3128'

req = urllib.request.Request(url,headers=headers)
req.set_proxy(proxy_host, 'https')
resp = urllib.request.urlopen(req).read()
soup = BeautifulSoup(resp,"html5lib")
ip_addr = soup.select_one("td:contains('IP')").find_next('td').text
print(ip_addr)

Another way (using os.environ):

headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
proxy = '60.191.11.246:3128'

os.environ["https_proxy"] = f'http://{proxy}'
req = urllib.request.Request(url,headers=headers)
resp = urllib.request.urlopen(req).read()
soup = BeautifulSoup(resp,"html5lib")
ip_addr = soup.select_one("td:contains('IP')").find_next('td').text
print(ip_addr)

One more approach that I've tried with:

agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
proxy_host = '205.158.57.2:53281'
proxy = {'https': f'http://{proxy_host}'}

proxy_support = urllib.request.ProxyHandler(proxy)
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
opener.addheaders = [('User-agent', agent)]
res = opener.open(url).read()

soup = BeautifulSoup(res,"html5lib")
ip_addr = soup.select_one("td:contains('IP')").find_next('td').text
print(ip_addr)

How can I use https proxy within urllib.request in the right way?

MITHU
  • 113
  • 3
  • 12
  • 41
  • Are you forced to use `urllib`? – AMC Jan 04 '20 at 22:39
  • 1
    Yes @AMC. I usually use proxies within requests. However, I got stuck when it comes to implement the same within urllib.request. – MITHU Jan 04 '20 at 22:59
  • Is that the full error message? Where does the error occur? It would probably be difficult to provide a reproducible example for this, eh? – AMC Jan 04 '20 at 23:06
  • 1
    I can't find the relevant docs, but I think you have to use 'http' instead of 'https' as the protocol in `.set_proxy()` (eg: `req.set_proxy(proxy_host, 'http')`). With 'http' I have about the same success as with the `requests` lib, but when I use 'https' I'm only getting HTTP and Connection errors. – t.m.adam Jan 06 '20 at 16:55
  • Thanks for your suggestion @t.m.adam. I got what you meant. However, you perhaps forgot to mention whether I use `http proxy` or `https proxy` within `proxy_host` variable. Btw, how I used proxy, as in `req.set_proxy(proxy_host, 'http')` is what I could see in the linked post. I'm always ready to comply with any better way as long as it is within `urllib.request`. – MITHU Jan 06 '20 at 18:02
  • Does the problem persist if you change the protocol? This is how I tested: I took a bunch of free proxies and I tested each one, first with `requests` and then with your `urllib` code. It seems the response is more or less the same with both libraries, if the protocol is set to 'http' - I mean the second parameter in `.set_proxy()`, I didn't modify the format of the `proxy_host` string at all, and I think it's format should be `host:port`, without any protocol. – t.m.adam Jan 06 '20 at 20:41
  • Hi @t.m.adam, what I meant is few proxies support `http` protocol whereas few support both `http` and `https`. So, which type of proxies should I supply within `proxy_host` variable? – MITHU Jan 07 '20 at 04:55
  • 1
    `build_opener` approach is working with `{'https': 'https://51.91.137.248:3128'}` proxy – Sers Jan 07 '20 at 15:52
  • To be clearer - [this](https://filebin.net/b9sjh9z75cr9sr4w) is what I meant @t.m.adam. – MITHU Jan 07 '20 at 15:54
  • It really is not working in my case @Sers. Did you change something before execution to make it work other than the proxy? – MITHU Jan 07 '20 at 16:10
  • So, I ran some tests with free SSL proxies from this list: https://www.sslproxies.org/. I couldn't make it work with the `.set_proxy()` method, but Sers's suggestion seems to work. Here: https://stackoverflow.com/a/36881923/7811673 you can find an example of `build_opener` with a `ProxyHandler`. – t.m.adam Jan 07 '20 at 16:15
  • @MITHU no I did not. Problem is in the proxy you're using. Set manually (118.70.144.77:3128) proxy in browser and you'll see security issue in whatismyipaddress.com/proxy-check and google. This is a reason of response error you can get for some sites. But you it works, tried with `url = 'https://stackoverflow.com/'` – Sers Jan 07 '20 at 16:45
  • Yep, I came across that answer as well. I appended the rest of the portion to add the headers @t.m.adam. Am I right with the way I defined headers in there? Thanks. – MITHU Jan 07 '20 at 16:50
  • This very site `https://stackoverflow.com/` works weirdly when there is any proxy involved @Sers. I'm using rotation of proxies where there are around 100 proxies, so I should not encounter that security issue. – MITHU Jan 07 '20 at 16:51
  • Yes, you can add headers in the opener. BTW you can test your headers and other parts of the request with https://httpbin.org/anything – t.m.adam Jan 07 '20 at 17:37
  • @MITHU it's depends on sites you'll go. Do you want to check if proxy is working before use it? – Sers Jan 07 '20 at 17:46
  • Finally it appears to be working @t.m.adam. Thanks Sers. Btw, is `build_opener()` the only way along with `{'https': 'https://51.91.137.248:3128'}`? – MITHU Jan 07 '20 at 18:02
  • I'm not sure, but I couldn't make `.set_proxy()` work with SSL proxies. – t.m.adam Jan 07 '20 at 18:16
  • As I mentioned before, `118.70.144.77` or `51.91.137.248` had security issue with `https://whatismyipaddress.com/proxy-check`, but now does not have. The problem was to many requests from same IP to google servers, and `https://whatismyipaddress.com/proxy-check` uses google's captcha. This was the reason error in `whatismyipaddress`, but it does not affect other sites. You can choose @t.m.adam or mine as accepted answer if you think the issue is solved. – Sers Jan 07 '20 at 18:40
  • 1
    @Sers It was your idea to use an opener, I just ran some tests. If one of us should take the credit it is you. – t.m.adam Jan 07 '20 at 18:50

1 Answers1

1

While we were testing the proxes, there was unusual traffic from your computer network for Google services and that was the reason of response error, because whatismyipaddress uses Google's services. But the issue was not affect other sites like stackoverflow.

from urllib import request
from bs4 import BeautifulSoup

url = 'https://whatismyipaddress.com/proxy-check'

proxies = {
    # 'https': 'https://167.172.229.86:8080',
    # 'https': 'https://51.91.137.248:3128',
    'https': 'https://118.70.144.77:3128',
}

user_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
headers = {
    'User-Agent': user_agent,
    'accept-language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7'
}

proxy_support = request.ProxyHandler(proxies)
opener = request.build_opener(proxy_support)
# opener.addheaders = [('User-Agent', user_agent)]
request.install_opener(opener)

req = request.Request(url, headers=headers)
try:
    response = request.urlopen(req).read()
    soup = BeautifulSoup(response, "html5lib")
    ip_addr = soup.select_one("td:contains('IP')").find_next('td').text
    print(ip_addr)
except Exception as e:
    print(e)
Sers
  • 12,047
  • 2
  • 12
  • 31
  • Thanks for your answer @Sers. Let us wait while the bounty is on, if anything new comes along. – MITHU Jan 08 '20 at 06:25