0

I'm running into trouble scraping a website after they changed from http to https and don't know how to solve this. The website I'm trying to scrape from is https://www.boldsystems.org. Two days ago it was still http://www.boldsystems.org and my scraper worked perfectly.

Example code:

import requests
requests.get('https://www.boldsystems.org')

Error code I get back:

Traceback (most recent call last):
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\contrib\pyopenssl.py", line 488, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\OpenSSL\SSL.py", line 1671, in _raise_ssl_error
    _raise_current_error()
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\OpenSSL\_util.py", line 54, in exception_from_error_queue
    raise exception_type(errors)
OpenSSL.SSL.Error: [('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 976, in _validate_conn
    conn.connect()
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connection.py", line 370, in connect
    ssl_context=context,
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\util\ssl_.py", line 377, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\contrib\pyopenssl.py", line 494, in wrap_socket
    raise ssl.SSLError("bad handshake: %r" % e)
ssl.SSLError: ("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\adapters.py", line 449, in send
    timeout=timeout
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 725, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\util\retry.py", line 439, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='boldsystems.org', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_
certificate', 'certificate verify failed')])")))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\dommi\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\adapters.py", line 514, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='boldsystems.org', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_cert
ificate', 'certificate verify failed')])")))

I found some solutions suggesting disabling the verification like:

requests.get('https://boldsystems.org', verify = False)

But I think this is a bad way to do it since there is a reason for the SSL verification.

I already updated certify, requests and urllib3. I also tried saving the SSL certificate to a .pem file and handing it to the request function, but I'm actually not sure what that does and it did not help.

I can reproduce the issue on Windows and Ubuntu as well as on different computers, so I think the problem is somewhere at the website I try to request.

I'd really appreciate a solution to my problem or an explanation what is happening here.

2 Answers2

1

I tested boldsystems.org on my own system (Ubuntu 18.04, Python 3.6.9) and got identical results. Regular browsers work fine though. SSLLabs' free ssltest tool reports that "This server's certificate chain is incomplete....".

An incomplete certificate chain just means that the server is not sending intermediate certificates in the chain. The browser probably has the entire chain cached and so works fine, unlike Python.

The fix is to present a certificate bundle to requests to verify, so that it is able to evaluate the entire chain. Ugly, but it should work. You're going to need to download all the certificates in the chain, concatenate them and present them to requests. This is explained at https://blogs.gnome.org/danni/2015/11/26/using-an-ssl-intermediate-as-your-ca-cert-with-python-requests/.

sevenr
  • 379
  • 3
  • 7
  • Yes this seems to work. Is it possible to ship the certificate bundle to the end user in a .pem file as data and put them in every request or is the SSL certificate unique for every user? – Dominik Buchner Apr 17 '20 at 20:05
  • The SSL certificate (chain) is unique to a domain. It will remain the same for all users, so you can ship it. – sevenr Apr 17 '20 at 20:06
0

I'm not quite sure why this is the case, but I had to manually add the certificate info to certifi's cacert.pem file to get it to work.

Follow the steps given here: Unable to get local issuer certificate when using requests in python

Then it works with a 200:

>>> requests.get('https://www.boldsystems.org')
<Response [200]>
csm10495
  • 569
  • 6
  • 12
  • This works for me as well. The problem is that I ship the webscraper via pip to other users and would like to have a simpler workaround or an explanation or method to handle this issue with the installer. – Dominik Buchner Apr 17 '20 at 20:01