2 Valid URLs, requests.get() fails on 1 but not the other. Why?

Question

I'm running a script to verify a bunch of links from a DB. A bunch of the failures are actually valid links though, but I can't work out why they are failing and if there's a way to get them to work.

I'm new to this, I've tried a few things, different headers, no header, longer timeout, no timeout. Not sure what to try next. I'm running this on a Windows 10 machine, through a proxy server, the proxy has been set up in the user settings file.

Here's some test code, the first URL fails, the second works.

# For handling the requests to the webpages
import requests
from requests_negotiate_sspi import HttpNegotiateAuth


# Test results, 1 record per URL to test
w = open(r'C:\Temp\URL_Test_Results.txt', 'w')

# For errors only
err = open(r'C:\Temp\URL_Test_Error_Log.txt', 'w')

print('Starting process')


def test_url(url):
    # Test the URL and write the results out to the log files.

    # Had to disable the warnings, by turning off the verify option, a warning is generated as the
    # website certificates are not checked, so results could be "bad". The main site throws errors
    # into the log for each test if we don't turn it off though.
    requests.packages.urllib3.disable_warnings()
    headers={'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
    print('Testing ' + url)
    # Try the website link, check for errors.
    try:
        response = requests.get(url, auth=HttpNegotiateAuth(), verify=False, headers=headers, timeout=5)
    except requests.exceptions.HTTPError as e:
        print('HTTP Error')
        print(e)
        w.write('HTTP Error, check error log' + '\n')
        err.write('HTTP Error' + '\n' + url + '\n' + e + '\n' + '***********' + '\n' + '\n')
    except requests.exceptions.ConnectionError as e:
        # some external sites come through this, even though the links work through the browser
        # I suspect that there's some blocking in place to prevent scraping...
        # I could probably work around this somehow.
        print('Connection error')
        print(e)
        w.write('Connection error, check error log' + '\n')
        err.write(str('Connection Error') + '\n' + url + '\n' + str(e) + '\n' + '***********' + '\n' + '\n')
    except requests.exceptions.RequestException as e:
        # Any other error types
        print('Other error')
        print(e)
        w.write('Unknown Error' + '\n')
        err.write('Unknown Error' + '\n' + url + '\n' + e + '\n' + '***********' + '\n' + '\n')
    else:
        # Note that a 404 is still 'successful' as we got a valid response back, so it comes through here
        # not one of the exceptions above.
        response = requests.get(url, auth=HttpNegotiateAuth(), verify=False)
        print(response.status_code)
        w.write(str(response.status_code) + '\n')
        print('Success! Response code:', response.status_code)
    print('========================')


test_url('https://www.abs.gov.au/websitedbs/D3310114.nsf/home/census')
test_url('https://www.statista.com/')

print('Done!')
w.close()
err.close()

The first URL fails to connect, the second one returns a 200 code, but both links work for me through a browser.

My log looks like this:

Starting process
Testing https://www.abs.gov.au/websitedbs/D3310114.nsf/home/census
Connection error
HTTPSConnectionPool(host='www.abs.gov.au', port=443): Max retries exceeded with url: /websitedbs/D3310114.nsf/home/census (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x000001310B30B4E0>, 'Connection to www.abs.gov.au timed out. (connect timeout=5)'))
========================
Testing https://www.statista.com/
200
Success! Response code: 200
========================
Done!

Both work for me. Your IP address may have been blacklisted by the first site. — Selcuk, Aug 29 '19 at 23:54
But why do they both work from my browser, but only 1 works using the code above? Thanks for the verification, it's good to know that the code CAN work... — 9Squirrels, Aug 30 '19 at 00:30
What is the error code returned? It is possible that they don't like the `User-Agent`, expect a cookie, or something similar. A semi-decent WAF is expected now that we spent $400+ million for the census. — Selcuk, Aug 30 '19 at 00:35
Sorry, I put that in a previous version of this question, forgot to add it here: HTTPSConnectionPool(host='www.abs.gov.au', port=443): Max retries exceeded with url: /websitedbs/D3310114.nsf/home/census (Caused by ConnectTimeoutError(, 'Connection to www.abs.gov.au timed out. (connect timeout=5)')) — 9Squirrels, Aug 30 '19 at 00:37

2 Valid URLs, requests.get() fails on 1 but not the other. Why?

0 Answers0

Linked