SSL and NewConnectionError

Question

I want to crawl a given list by the Top-1-Million from Alexa, to check which website still offers acces via http:// an do not redirect to https://. If the webpage does not redirect to a https:// Domain, it should be written into a csv file.

The Problem occurs, when I am adding a bunch of multiple URLs. Than I get two errors:

ssl.SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1056

or

requests.exceptions.ConnectionError: HTTPConnectionPool(host='17ok.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 11001] getaddrinfo failed')

I have tried the opportunities mentioned in the following threads and documentation:

https://2.python-requests.org//en/latest/user/advanced/#ssl-cert-verification Edit: the sample url: https://requestb.in raises a 404 error actually, probably does not exist even more (?)
Python Requests throwing SSLError
Python Requests: NewConnectionError
requests.exceptions.SSLError: HTTPSConnectionPool: (Caused by SSLError(SSLError(336445449, '[SSL] PEM lib (_ssl.c:3816)')))

and some other delivered solutions.

The option to set verify=False helps, when using it for few URLs, but not when using a List > 10 URLs, the program brakes. I tried my program on a Win10 machine as well as on Ubuntu 16.04. As expected, its the same issue. I also tried the option using Sessions and installed the certificate library as sugested.

If I am just calling three pages like 'http://www.example.com', 'https://www.github.com' and 'http://www.python.org', its not a big deal and the delivered solutions. The Headache starts, when using a bunch of URLs from the Alexa List.

Here is my code, which is working, when using it for only 3-4 urls:

import requests
from requests.utils import urlparse

urls = ['http://www.example.com',
        'http://bloomberg.com',
        'http://github.com',
        'https://requestbin.fullcontact.com/']

with open('G:\\Request_HEADER_Suite/dummy/http.csv', 'w') as f:
    for url in urls:
        r = requests.get(url, stream=True, verify=False)
        parsed_url = urlparse(r.url)
        print("URL: ", url)
        print("Redirected to: ", r.url)
        print("Status Code: ", r.status_code)
        print("Scheme: ", parsed_url.scheme)
        if parsed_url.scheme == 'http':
            f.write(url + '\n')

I expect to crawl at least a list with 100 URLs. The code should write URLs which are accessible by http:// and do not redirect to https:// into a csv file or complementary database and ignore all URLs with https://.

Because it is working for few URLs, I would expectd a stable opportunity for a larger scan.

But 2 errors araise and break the program. Is it worthy to try a workaround using pytest? Any other suggestions? Thanks in advance.

EDIT: This is a list, which will raise errors. Only for clarification, this list from a study based on the Alexa-Top-1-Million.

urls = ['http://www.example.com',
        'http://bloomberg.com',
        'http://github.com',
        'https://requestbin.fullcontact.com/',
        'http://51sole.com',
        'http://58.com',
        'http://9gag.com',
        'http://abs-cbn.com',
        'http://academia.edu',
        'http://accuweather.com',
        'http://addroplet.com',
        'http://addthis.com',
        'http://adf.ly',
        'http://adhoc2.net',
        'http://adobe.com',
        'http://1688.com',
        'http://17ok.com',
        'http://17track.net',
        'http://1and1.com',
        'http://1tv.ru',
        'http://2ch.net',
        'http://360.cn',
        'http://39.net',
        'http://4chan.org',
        'http://4pda.ru']

I double checked, the last time the errors starts with the url 17.ok.com. But I have also tried different lists with urls. Thanks for your support.

@Mark G, I would like to do that, wait I see how I can add a list here... — FDE, Jul 31 '19 at 20:35
@Mark G, I have added a list above. Thanks for your support. — FDE, Jul 31 '19 at 20:50
HI, I think I found a way around, in case anyone else run into the same issue... I added a time.sleep() of 25 sec. And the programm run through. — FDE, Jul 31 '19 at 21:28

SSL and NewConnectionError

0 Answers0