6

I am trying to scrape a website using requests in python.

url = "https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url"
# set the headers like we are a browser,
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}

# download the homepage
s = requests.Session()
s.trust_env = False
response = s.get(url, headers=headers )

This is working fine when I use my personal wifi. However, when I connect to my company's VPN, I get the following error.

ConnectionError: HTTPSConnectionPool(host='stackoverflow.com', port=443): Max retries exceeded with url: /questions/23013220/max-retries-exceeded-with-url (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',))

Now, I need this to work over my company's VPN because I need to access a website which works only in that. How to resolve this?

user1690356
  • 71
  • 1
  • 4

4 Answers4

4

In my case, the problem was related to IPv6.

Our VPN used split tunneling, and it seems the VPN configuration does not support IPv6.

So for example this would hang forever:

requests.get('https://pokeapi.co/api/v2/pokemon')

But if you add a timeout, the request succeeds:

requests.get('https://pokeapi.co/api/v2/pokemon', timeout=1)

But not all machines were having this problem. So I compared the output of this among two different machines:

import socket

for line in socket.getaddrinfo('pokeapi.co', 443):
    print(line)

The working one only returned IPv4 addresses. The non-working machine returned both IPv4 and IPv6 addresses.

So with the timeout specified, my theory is that python fails quickly with IPv6 and then moves to IPv4, where the request succeeds.

Ultimately we resolved this by disabling IPv6 on the machine:

networksetup -setv6off "Wi-Fi"

But I assume that this could instead be resolved through VPN configuration.

dstandish
  • 2,328
  • 18
  • 34
3

How about trying like this:

url = "https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url"
ua = UserAgent()
headers = headers = {"User-Agent": ua.random}

# download the homepage
s = requests.Session()
s.trust_env = False
response = s.get(url, headers=headers)

It seems to be caused by UserAgent() settings difference.

jihan1008
  • 340
  • 1
  • 10
2

Try to set trust_env = None

trust_env = None # Trust environment settings for proxy configuration, default authentication and similar.

Or you can disable proxies for a particular domain. The question

import os
os.environ['NO_PROXY'] = 'stackoverflow.com'
KC.
  • 2,981
  • 2
  • 12
  • 22
  • @user1690356 can you access other website(use https) with your code ? – KC. Oct 10 '18 at 13:51
  • @user1690356 try to set your headers {'Connection': 'close'} or set requests.adapters.DEFAULT_RETRIES = 5 , the others reason will occur this error is frequently access – KC. Oct 10 '18 at 13:58
1

In my organization, I have to run my program under VPN for different geo locations. so we have multiple proxy configurations.

I found it simpler to use a package called PyPAC to get my proxy details automatically

from pypac import PACSession
from requests.auth import HTTPProxyAuth
session = PACSession()
# when the username and password is required
# session = PACSession(proxy_auth=HTTPProxyAuth(name, password)) 

r = session.get('http://example.org')

How does this work:

The package locates the PAC file which is configured by the organization. This file consist of proxy configuration detail (more info).

naxfury
  • 11
  • 1