0

Description:

  • When running experiments using Weights and Biases (wandb), I occasionally get a PermissionError for Python's logging library and OSError for accessing the TLS CA cert.

  • I had the following stacktrace, repeated many times with different types of "message". I can't discern the order of operations, but I'm guessing the cert can't be accessed and that causes the script to crash, but I don't know why it only happens sometimes.

  • If it is relevant, I ran the experiments on an Ubuntu server, authenticated via Kerberos.

What I've tried:

  • I have manually checked the CA cert, and more than half the time I can successfully run experiments. As such I don't think it's the same as this or this.

Stacktrace

Message: 'handle_request: stop_status'                                                                                                                                      [854/1967]Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/logging/__init__.py", line 1085, in emit
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/logging/__init__.py", line 1065, in flush
PermissionError: [Errno 13] Permission denied
Call stack:
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/threading.py", line 890, in _bootstrap
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 54, in run
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 95, in _run
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/wandb/sdk/internal/internal.py", line 280, in _process
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/wandb/sdk/internal/sender.py", line 175, in send
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/wandb/sdk/internal/sender.py", line 183, in send_request
Message: 'send_request: stop_status'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/wandb/apis/normalize.py", line 24, in wrapper
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 681, in check_stop_requested
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/wandb/sdk/lib/retry.py", line 102, in __call__
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 127, in execute
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 52, in execute
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 60, in _get_result
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/wandb/vendor/gql-0.2.0/gql/transport/requests.py", line 38, in execute
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/requests/api.py", line 119, in post
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/requests/api.py", line 61, in request
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/requests/sessions.py", line 530, in request
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/requests/sessions.py", line 643, in send
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/requests/adapters.py", line 416, in send
  File "/home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/requests/adapters.py", line 227, in cert_verify
OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /home/some_user/miniconda3/envs/part_ii_dev-conda/lib/python3.8/site-packages/certifi/cacert.pem
peractio
  • 593
  • 1
  • 5
  • 14
  • It was likely a Kerberos error, it still sometimes happens after I do `kinit -R` but I'm trying out the custom Kerberos ticket refresh script by the people managing the server, and hopefully it'll be solved. – peractio May 12 '21 at 10:43
  • Hey peractio, I'm a member of the W&B team. We're you able to get this resolved? Can you share a config to see if we can reproduce this – Scott Condron Aug 13 '21 at 09:50
  • It was due to kerberos ticket expiry issues. I then used a custom script to renew kerberos tickets periodically to resolve it – peractio Aug 30 '21 at 23:20
  • Glad you got your issue solved :) – Scott Condron Aug 31 '21 at 15:27

0 Answers0