I am trying to download the tokenizer from Huggingface for BERT.
I am executing:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
Error:
<Path>\tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1663 resume_download=resume_download,
1664 local_files_only=local_files_only,
-> 1665 use_auth_token=use_auth_token,
1666 )
1667
<Path>\file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
1140 user_agent=user_agent,
1141 use_auth_token=use_auth_token,
-> 1142 local_files_only=local_files_only,
1143 )
1144 elif os.path.exists(url_or_filename):
<Path>\file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only)
1347 else:
1348 raise ValueError(
-> 1349 "Connection error, and we cannot find the requested files in the cached path."
1350 " Please try again or make sure your Internet connection is on."
1351 )
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
Based on a similar discussion on github in huggingface's repo, I gather that the file that the above call wants to download is: https://huggingface.co/bert-base-uncased/resolve/main/config.json
While I can access that json file perfectly well on my browser, I can not download it via requests. The error I get is:
>> import requests as r
>> r.get('https://huggingface.co/bert-base-uncased/resolve/main/config.json')
...
requests.exceptions.SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /bert-base-uncased/resolve/main/config.json (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
While examining the certificate of the page - https://huggingface.co/bert-base-uncased/resolve/main/config.json, I see that it is signed by my IT department not the standard CA root I would expect to find. Based on discussion here, it looks like it is plausible for SSL proxies to do something like this.
My IT department's certificate is in the trusted authorities list. But requests does not seem to be considering that list for trusting certificates.
Taking a cue from a stack-overflow discussion on how to let requests trust a self-signed certificate I have also tried append cacert.pem (file pointed to by curl-config --ca) with the ROOT certificate that appears for the huggingface and adding the path of this pem to REQUESTS_CA_BUNDLE
export REQUESTS_CA_BUNDLE=/mnt/<path>/wsl-anaconda/ssl/cacert.pem
But it did not help at all.
Would you know how I can let requests know that it is OK to trust my IT department's certificate ?
P.S: If it matters, I am working on windows and am facing this in WSL as well.