1

I am trying to screenscrape PyPI packages using the requests library and beautiful soup - but am met with an indefinite hang. I am able to retrieve html from a number of sites with:

session = requests.Session()
session.trust_env = False
response = session.get("http://google.com")
print(response.status_code)

i.e. without providing headers. I read from Python request.get fails to get an answer for a url I can open on my browser that the indefinite hang is likely caused by incorrect headers. So, using the developer tools, I tried to grab my request headers from the Networking tab (using Edge) with "Doc" filter to select the pypi.org response/request. I simply copy pasted these into my header variable that is passed to the get method:

headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'session_id=<long string>',
'dnt': '1',
'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Microsoft Edge";v="108"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.54'}

(and changing get method to response = session.get("http://pypi.org", headers=headers))

But I get the same hang. So, I think something is wrong with my headers but I'm not sure what. I'm aware that the requests Session() "handles" cookies so I tried removing the cookie key/value pair in my request header dictionary but achieved the same result.

How can I determine the problem with my headers and/or why do my current headers not work (assuming this is even the problem)?

Sterling Butters
  • 1,024
  • 3
  • 20
  • 41
  • I don't think `pypi.org` blocks clients basing on user agent or other cookies, at least this is not reproducible for me. Try to use something like [Wireshark](https://www.wireshark.org/) to investigate what's happening when you make a request from Python. – Vader Jan 24 '23 at 23:11
  • @Vader You mean you are able to access PyPI html from python? I will not be able to use Wireshark on my system since I don't have software install rights – Sterling Butters Jan 25 '23 at 15:29
  • Yes, I'm able to download html content from PyPI. Since you're not able to install software on your machine I'd assume that you're running this code in quite restricted environment where you might also have antiviruses, corporate proxies, etc. and they are likely a source of the problem – Vader Jan 25 '23 at 16:03
  • @Vader I do have a corporate proxy that I have been able to "bypass" in the past with `session = requests.Session()` `session.trust_env = False`. How can I confirm that the proxy is indeed the issue? – Sterling Butters Jan 25 '23 at 18:43
  • By setting this flag you might bypass the proxy, but it doesn't mean that you have access to the interternet without proxy, since direct access might be blocked by your corporate firewall. Btw, why do you want to bypass it? – Vader Jan 25 '23 at 22:08
  • this is amazing. you can see it with your browser but the *identical* request from python fails. according to the UA you're using windows. can you use powershell? (simply search for `powershell` and hit `Enter`. or winkey+r powershell). what happens when you run `Invoke-WebRequest http://pypi.org`? – Yarin_007 Jan 25 '23 at 23:57
  • I am able to retrieve pypi without headers. Where are you located? Are you just trying to get the main page? According to the robots.txt, subpages are not allowed. Also have you tried getting `response = session.get("https://pypi.org")` instead of http? Maybe you have a config, that disallows redirects. – Lukas Hestermeyer Jan 26 '23 at 10:58
  • @Yarin_007 I actually get an output `StatusCode : 200 StatusDescription : OK Content :` – Sterling Butters Jan 26 '23 at 23:33
  • @LukasHestermeyer I have also tried without headers to no avail. I have tried `https` instead of `http` (in fact I think I started with `https` since that's what my browser was using. – Sterling Butters Jan 26 '23 at 23:34
  • Absolutely remarkable. any luck with [ulrlib3](https://urllib3.readthedocs.io/en/stable/)? – Yarin_007 Jan 27 '23 at 01:51
  • @Vader see I wouldn't disagree with that except I can access the site from my browser fine AND I can access other sites (only when I bypass the proxy though) with the python request – Sterling Butters Jan 27 '23 at 05:22
  • @SterlingButters what version of python and requests are you using? – Lukas Hestermeyer Jan 27 '23 at 09:56
  • I would suggest to create a new venv or conda env or whatever you are using, to avoid that we have a side effect from other packages. – Lukas Hestermeyer Jan 27 '23 at 09:57
  • @LukasHestermeyer Python 3.9.12; Requests 2.27.1 – Sterling Butters Jan 27 '23 at 16:25
  • @Yarin_007 Same hang with `urllib3` – Sterling Butters Jan 27 '23 at 16:28

4 Answers4

1

HTTP headers are a possible issue, but not a likely one. A more probable cause is a proxy/firewall. I'll start by recapping the information I think is relevant from the comments;

  • You are using a system, on which you do not have admin privileges.
  • The system is configured to use a corporate proxy server.
  • http://pypi.org works from your browser.
  • http://pypi.org works from a PowerShell on your system.
  • http://pypi.org hangs with your python code.
  • Your system is running Windows. (probably irrelevant, but might be worth noting)

As both your browser as well as PowerShell seem to work fine, if you didn't change their settings, why are you trying to circumvent the proxy using python? (@vader asked this in comments, I didn't see a relevant response)
If circumventing the proxy is material to your goal, skip this section to the next (after the horizontal bar). If it isn't, as other programs seem to work fine, I suggest trying with the proxy using the system's original configuration;

  1. Remove the session.trust_env = False statement from the code.
  2. Test the code now. If it works, our job is done . Otherwise, keep reading.
  3. Revert all system changes you've made trying to make it work.
  4. Reboot your system.
    I myself hate it when someone suggests that to me, but I found there are two good reasons to do that; the first is that there might be something stuck in the O/S and a reboot will release that, and the second is that I might not remember all the things I tinkered with to revert, and a reboot might do the job for me.
  5. Test again. Test the script, and with a browser, and with PowerShell (as per @yarin-007 's comment).

If the script still hangs on requests to pypi, further analysis is required. In order to narrow down the options, I suggest the following:

  1. Disable redirects by setting allow_redirects=False. While requests should raise a TooManyRedirects exception if there is a redirect loop, this would help identify a case where a redirect target is hanging. pypi should redirect http to https regardless of user-agent, or most other headers, which makes for a consistent, reliable request, limiting other possible factors.
  2. Set a request timeout. The type of exception raised on timeout expiration can help identify the cause.

The following code provides a good example. For your code, don't use the port numbers, the defaults should work. I added the port numbers explicitly, as each one demonstrates a different possible scenario:

#!/usr/bin/env python
import socket
import timeit
import requests

TIMEOUT = (4, 7)    # ConnectT/O (per-IP), ReadT/O

def get_url(url, timeout=TIMEOUT):
    try:
        response = requests.get(url, timeout=timeout, allow_redirects=False)
        print(f"Status code: {response.status_code}", end="")
        if response.status_code in (301, 302):
            print(f", Location: {response.headers.get('location')}", end="")
        print(".")
    except Exception as e:
        print(f"Exception caught: {e!r}")
    finally:
        print(f"Fetching url '{url}' done", end="")

def time_url(url):
    print(f"Trying url '{url}'")
    total = timeit.timeit(f"get_url('{url}')", number=1, globals=globals())
    print(f" in: {str(total)[:4]} seconds")
    print("=============")

def print_expected_conntimeout(server):
    r = socket.getaddrinfo(server, None, socket.AF_UNSPEC, socket.SOCK_STREAM)
    print(f"IP addresses of {server}:\n" + "\n".join(addr[-1][0] for addr in r))
    print(f"Got {len(r)} addresses, so expecting a a total ConnectTimeout of {len(r) * TIMEOUT[0]}")

def main():
    scheme = "http://"
    server = "pypi.org"
    uri = f"{scheme}{server}:{{port}}".format

    print_expected_conntimeout(server)
    # OK/redirect (301)
    time_url(uri(port=80))
    # READ TIMEOUT after 7s
    time_url(uri(port=8080))
    # CONNECTION TIMEOUT after 4 * ip_addresses
    time_url(uri(port=8082))
    # REJECT
    time_url('http://localhost:80')

if __name__ == "__main__":
    main()

For me, this outputs:

$ ./testnet.py
IP addresses of pypi.org:
151.101.128.223
151.101.0.223
151.101.64.223
151.101.192.223
Got 4 addresses, so expecting a a total ConnectTimeout of 16
Trying url 'http://pypi.org:80'
Status code: 301, Location: https://pypi.org/.
Fetching url 'http://pypi.org:80' done in: 0.66 seconds
=============
Trying url 'http://pypi.org:8080'
Exception caught: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='pypi.org', port=8080): Read timed out. (read timeout=7)"))
Fetching url 'http://pypi.org:8080' done in: 7.21 seconds
=============
Trying url 'http://pypi.org:8082'
Exception caught: ConnectTimeout(MaxRetryError("HTTPConnectionPool(host='pypi.org', port=8082): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x103ec4730>, 'Connection to pypi.org timed out. (connect timeout=4)'))"))
Fetching url 'http://pypi.org:8082' done in: 16.0 seconds
=============
Trying url 'http://localhost:80'
Exception caught: ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x103ec44c0>: Failed to establish a new connection: [Errno 61] Connection refused'))"))
Fetching url 'http://localhost:80' done in: 0.00 seconds
=============

Now to explain the four cases:

  1. A successful request to http://pypi.org returns a 301 redirect - to use https.
    This is what you should get. If this is what you do get after adding allow_redirects=False, then the prime suspect is the redirect chain, and I suggest similarly checking each location header's value for every redirect response you receive, until you find the URL that hangs.
  2. Connection to port 8080 is successful (successful 3-way handshake), but the server does not return a proper response, and "hangs". requests raises a ReadTimeout exception.
    If your script raises this exception, it is likely that you are connecting to some sort of proxy which would not properly relay (or actively block) the request or the response. There might be some other system setting controlling this than trust_env, or some appliance attached to the network's infrastructure.
  3. Connection to port 8082 is not successful; a 3-way handshake could not be established, and requests raises a ConnectTimeout exception. Note that a connection would be attempted to each IP address found, so the timeout of 4 seconds would be multiplied by the amount of addresses, overall.
    If this is what you see, it is likely that there is some firewall between your machine and pypi, which either prevents your SYN packets getting to their destination, or prevents the SYN+ACK packet getting back from the server to your machine.
  4. The fourth case is provided as an example, which I don't believe you'll encounter, but in case you do it is worth explaining. In this case, the SYN packet either reached a server which does not listen on the desired port (which would be weird, possibly meaning you didn't really reach pypi), or that a firewall REJECTed your SYN packet (vs. simply DROPping it).

Another thing worth paying attention to, is pypi's IP addresses, as they are printed by the provided script. While IPv4 addresses are not guaranteed to keep their assignment, in this case if you find they are significantly different - that would suggest that you are not actually connecting to the real pypi servers, so the responses are unpredictable (including hangs). Following are pypi's IPv4 and IPv6 addresses:

pypi.org has address 151.101.0.223
pypi.org has address 151.101.64.223
pypi.org has address 151.101.128.223
pypi.org has address 151.101.192.223
pypi.org has IPv6 address 2a04:4e42::223
pypi.org has IPv6 address 2a04:4e42:200::223
pypi.org has IPv6 address 2a04:4e42:400::223
pypi.org has IPv6 address 2a04:4e42:600::223

Finally, as we've touched the different IP protocol versions, it is also possible that when initiating a connection, your system attempts to use a protocol which has a faulty route to the destination (e.g. trying IPv6, but one of the gateways mishandles that traffic). Usually a router would reply with an ICMP failure message, but I've seen cases where that doesn't happen (or isn't properly relayed back). I wasn't able to determine the root cause as the route was out of my control, but forcing a specific protocol solved that specific issue for me.

Hoping this provides some good debugging vectors, if this helps please add a comment, as I'm curious to what you find.

micromoses
  • 6,747
  • 2
  • 20
  • 29
  • Thank you for such a detailed answer! In fact, I do get a 301 response. I'm trying now to understand the redirect chain. If I simply `allow_redirects` (=True), then I get a `ProxyError(MaxRetryError("HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url`. I.e. I was expecting to be able to view at least the first iteration in the chain but that appears to be 'http://pypi.org' itself. Regarding the `trust_env`, that was just a hack that had worked for me in the past. I think I tried forcing ipv4 in the past but it's possible I did something wrong. – Sterling Butters Jan 30 '23 at 16:01
  • Noooooooooo! I meant to award my bounty to your answer! – Sterling Butters Jan 30 '23 at 16:03
  • I guess I can start a new bounty and award you 200. Can you upvote my question to help me with the lost rep? I'll award you in 23 hours. Would still appreciate some help to actually figure out how to get the PyPI html (so that I can accept your answer) – Sterling Butters Jan 30 '23 at 16:12
0

I tried sending a simple HTTP request to see if this server requires any headers for a normal response.

So I opened a TCP socket and connected to the Pypi server to see how requests would be handled by the server without the intervention of frameworks. In addition, we wrap that socket in an SSL library to send encrypted traffic (HTTPS)

import socket
import ssl

hostname = 'pypi.org'
context = ssl.create_default_context()

payld = ("GET / HTTP/1.1\r\n"
         f"Host: {hostname}\r\n\r\n")
with socket.create_connection((hostname, 443)) as sock:
    with context.wrap_socket(sock, server_hostname=hostname) as ssock:
        text = payld
        ssock.sendall(text.encode())
        print(ssock.recv(40))

OUTPUT (It is only the first 40 bytes of the response, but we can see the status code, which is 200 OK):

b'HTTP/1.1 200 OK\r\nConnection: keep-alive\r'

As a result, we can conclude that headers have no effect.

I recommend that you try this code.

  • If it works: Upgrade the version of the requests library, then try again.
  • If it does not work: I'm guessing it's a network or SSL verification issue.
Karen Petrosyan
  • 372
  • 2
  • 7
0

Got it!

I just had to set the proxy variable in the get method:

headers={'User-Agent': 'Chrome'}

proxies = {
  'http': 'xxxxxx:80',
  'https': 'xxxxxx:80',
}

def get_url(url):
    try:
        response = requests.get(url, timeout=10, allow_redirects=True, headers=headers, proxies=proxies)
        print(response.headers)
        print(response.text)
        print(response.history)
        print(f"Status code: {response.status_code}")
        if response.status_code in (301, 302):
            print(f", Location: {response.headers.get('location')}")

    except Exception as e:
        print(f"Exception caught: {e!r}")
    finally:
        print(f"Fetching url '{url}' done", end="")
        

url = "http://pypi.org"
get_url(url)
Sterling Butters
  • 1,024
  • 3
  • 20
  • 41
0

ARE YOU SURE? only the homepage of pypi raises that error which you cannot scrape in any case do you have a firewall or a https or socks proxy?

i have taken url for all packages for python 3 and this codes works just fine

import requests
from bs4 import BeautifulSoup
hdr={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5"}
session = requests.Session()
url="https://pypi.org/search/?c=Programming+Language+%3A%3A+Python+%3A%3A+3" # All Python 3 libraries
response = session.get(url,allow_redirects=True)
print(response.status_code)

> 200

just to make sure..lets scrape the package names to verify

soup=BeautifulSoup(response.content,"lxml")
pkgs=soup.findAll('span',attrs={'class':'package-snippet__name'})
for i in pkgs:
    print(i.text)
>

yingyu-yueyueyue-201812-201909
github-actions-cicd-example
xurl
unkey
fluvio
LogicCircuit
knarrow
riyu-zhuanye-kaoyan-202203-202206
permutation
aliases
sangsangjun-202011-202101
resultify
subnuker
keke-yingyu-202101-202104
xuezhaofeng-beida-jingjixue
jingtong-jiaoben-heike
mrbenn-toolbar-plugin
liuwei-yasi-pindao-201811-201908
mypy-boto3-service-quotas
trender

the names of all python 3 packages on page 1

geekay
  • 340
  • 1
  • 5