I am having a weird issue where I change the location of all my code & data to a different location with more disk space, then I soft link my projects & data to those locations with more space. I assume there must be some file handle issue because wandb's logger is throwing me issues. So my questions:
- how do I have wandb only log online and not locally? (e.g. stop trying to log anything to
./wandb
[or any secret place it might be logging to] since it's creating issues). Note my code was running fine after I stopped logging to wandb so I assume that was the issue. note that thedir=None
is the default to wandb's param. - how do I resolve this issue entirely so that it works seemlessly with all my projects softlinked somewhere else?
More details on the error
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1087, in emit
self.flush()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1067, in flush
self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
self._bootstrap_inner()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/vendor/watchdog/observers/api.py", line 199, in run
self.dispatch_events(self.event_queue, self.timeout)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/vendor/watchdog/observers/api.py", line 368, in dispatch_events
handler.dispatch(event)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/vendor/watchdog/events.py", line 454, in dispatch
_method_map[event_type](event)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/filesync/dir_watcher.py", line 275, in _on_file_created
logger.info("file/dir created: %s", event.src_path)
Message: 'file/dir created: %s'
Arguments: ('/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/wandb/run-20221023_170722-1tfzh49r/files/output.log',)
--- Logging error ---
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1087, in emit
self.flush()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1067, in flush
self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
self._bootstrap_inner()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/internal_util.py", line 50, in run
self._run()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/internal_util.py", line 101, in _run
self._process(record)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/internal.py", line 263, in _process
self._hm.handle(record)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/handler.py", line 130, in handle
handler(record)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/handler.py", line 138, in handle_request
logger.debug(f"handle_request: {request_type}")
Message: 'handle_request: stop_status'
Arguments: ()
N/A% (0 of 100000) | | Elapsed Time: 0:00:00 | ETA: --:--:-- | 0.0 s/it
Traceback (most recent call last):
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1814, in <module>
main()
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1747, in main
train(args=args)
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1794, in train
meta_train_iterations_ala_l2l(args, args.agent, args.opt, args.scheduler)
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/training/meta_training.py", line 167, in meta_train_iterations_ala_l2l
log_zeroth_step(args, meta_learner)
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/meta_learning.py", line 92, in log_zeroth_step
log_train_val_stats(args, args.it, step_name, train_loss, train_acc, training=True)
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 55, in log_train_val_stats
_log_train_val_stats(args=args,
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 116, in _log_train_val_stats
args.logger.log('\n')
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logger.py", line 89, in log
print(msg, flush=flush)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/redirect.py", line 640, in write
self._old_write(data)
OSError: [Errno 116] Stale file handle
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: Synced vit_mi Adam_rfs_cifarfs Adam_cosine_scheduler_rfs_cifarfs 0.001: args.jobid=101161: https://wandb.ai/brando/entire-diversity-spectrum/runs/1tfzh49r
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20221023_170722-1tfzh49r/logs
--- Logging error ---
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router_sock.py", line 27, in _read_message
resp = self._sock_client.read_server_response(timeout=1)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 283, in read_server_response
data = self._read_packet_bytes(timeout=timeout)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 269, in _read_packet_bytes
raise SockClientClosedError()
wandb.sdk.lib.sock_client.SockClientClosedError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router.py", line 70, in message_loop
msg = self._read_message()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router_sock.py", line 29, in _read_message
raise MessageRouterClosedError
wandb.sdk.interface.router.MessageRouterClosedError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1087, in emit
self.flush()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1067, in flush
self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
self._bootstrap_inner()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router.py", line 77, in message_loop
logger.warning("message_loop has been closed")
Message: 'message_loop has been closed'
Arguments: ()
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmpmvf78q6owandb'>
_warnings.warn(warn_message, ResourceWarning)
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmpt5etqpw_wandb-artifacts'>
_warnings.warn(warn_message, ResourceWarning)
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmp55lzwviywandb-media'>
_warnings.warn(warn_message, ResourceWarning)
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmprmk7lnx4wandb-media'>
_warnings.warn(warn_message, ResourceWarning)
Error:
====> about to start train loop
Starting training!
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/
--- Logging error ---
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1086, in emit
stream.write(msg + self.terminator)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/redirect.py", line 640, in write
self._old_write(data)
OSError: [Errno 116] Stale file handle
Call stack:
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
self._bootstrap_inner()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/worker.py", line 128, in _target
callback()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/transport.py", line 467, in send_envelope_wrapper
self._send_envelope(envelope)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/transport.py", line 384, in _send_envelope
self._send_request(
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/transport.py", line 230, in _send_request
response = self._pool.request(
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/request.py", line 78, in request
return self.request_encode_body(
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/request.py", line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/poolmanager.py", line 375, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/connectionpool.py", line 780, in urlopen
log.warning(
Message: "Retrying (%r) after connection broken by '%r': %s"
Arguments: (Retry(total=2, connect=None, read=None, redirect=None, status=None), SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')), '/api/5288891/envelope/')
Bounty
My suggestions on what might solve this are:
- Figuring out a way to stop wandb logging locally or minimize the amount of logging wandb is logging locally.
- Figure out what is exactly being logged and minimize the space.
- have the logging work even if all the folders are being symlinked. (imho this should work out of the box)
- figuring out a systematic and simple way to find where the stale file handles are coming from.
I am surprised moving everything to /shared/rsaas/miranda9/
and running experiments from there did not solve the issue.
cross: