2

I am having a weird issue where I change the location of all my code & data to a different location with more disk space, then I soft link my projects & data to those locations with more space. I assume there must be some file handle issue because wandb's logger is throwing me issues. So my questions:

  1. how do I have wandb only log online and not locally? (e.g. stop trying to log anything to ./wandb[or any secret place it might be logging to] since it's creating issues). Note my code was running fine after I stopped logging to wandb so I assume that was the issue. note that the dir=None is the default to wandb's param.
  2. how do I resolve this issue entirely so that it works seemlessly with all my projects softlinked somewhere else?

More details on the error

Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1087, in emit
    self.flush()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1067, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/vendor/watchdog/observers/api.py", line 199, in run
    self.dispatch_events(self.event_queue, self.timeout)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/vendor/watchdog/observers/api.py", line 368, in dispatch_events
    handler.dispatch(event)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/vendor/watchdog/events.py", line 454, in dispatch
    _method_map[event_type](event)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/filesync/dir_watcher.py", line 275, in _on_file_created
    logger.info("file/dir created: %s", event.src_path)
Message: 'file/dir created: %s'
Arguments: ('/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/wandb/run-20221023_170722-1tfzh49r/files/output.log',)
--- Logging error ---
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1087, in emit
    self.flush()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1067, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/internal_util.py", line 50, in run
    self._run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/internal_util.py", line 101, in _run
    self._process(record)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/internal.py", line 263, in _process
    self._hm.handle(record)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/handler.py", line 130, in handle
    handler(record)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/handler.py", line 138, in handle_request
    logger.debug(f"handle_request: {request_type}")
Message: 'handle_request: stop_status'
Arguments: ()
N/A% (0 of 100000) |      | Elapsed Time: 0:00:00 | ETA:  --:--:-- |   0.0 s/it

Traceback (most recent call last):
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1814, in <module>
    main()
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1747, in main
    train(args=args)
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1794, in train
    meta_train_iterations_ala_l2l(args, args.agent, args.opt, args.scheduler)
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/training/meta_training.py", line 167, in meta_train_iterations_ala_l2l
    log_zeroth_step(args, meta_learner)
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/meta_learning.py", line 92, in log_zeroth_step
    log_train_val_stats(args, args.it, step_name, train_loss, train_acc, training=True)
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 55, in log_train_val_stats
    _log_train_val_stats(args=args,
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 116, in _log_train_val_stats
    args.logger.log('\n')
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logger.py", line 89, in log
    print(msg, flush=flush)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/redirect.py", line 640, in write
    self._old_write(data)
OSError: [Errno 116] Stale file handle
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: Synced vit_mi Adam_rfs_cifarfs Adam_cosine_scheduler_rfs_cifarfs 0.001: args.jobid=101161: https://wandb.ai/brando/entire-diversity-spectrum/runs/1tfzh49r
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20221023_170722-1tfzh49r/logs
--- Logging error ---
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router_sock.py", line 27, in _read_message
    resp = self._sock_client.read_server_response(timeout=1)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 283, in read_server_response
    data = self._read_packet_bytes(timeout=timeout)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 269, in _read_packet_bytes
    raise SockClientClosedError()
wandb.sdk.lib.sock_client.SockClientClosedError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router.py", line 70, in message_loop
    msg = self._read_message()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router_sock.py", line 29, in _read_message
    raise MessageRouterClosedError
wandb.sdk.interface.router.MessageRouterClosedError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1087, in emit
    self.flush()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1067, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router.py", line 77, in message_loop
    logger.warning("message_loop has been closed")
Message: 'message_loop has been closed'
Arguments: ()
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmpmvf78q6owandb'>
  _warnings.warn(warn_message, ResourceWarning)
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmpt5etqpw_wandb-artifacts'>
  _warnings.warn(warn_message, ResourceWarning)
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmp55lzwviywandb-media'>
  _warnings.warn(warn_message, ResourceWarning)
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmprmk7lnx4wandb-media'>
  _warnings.warn(warn_message, ResourceWarning)

Error:

====> about to start train loop
Starting training!
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/
--- Logging error ---
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1086, in emit
    stream.write(msg + self.terminator)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/redirect.py", line 640, in write
    self._old_write(data)
OSError: [Errno 116] Stale file handle
Call stack:
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/worker.py", line 128, in _target
    callback()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/transport.py", line 467, in send_envelope_wrapper
    self._send_envelope(envelope)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/transport.py", line 384, in _send_envelope
    self._send_request(
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/transport.py", line 230, in _send_request
    response = self._pool.request(
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/request.py", line 78, in request
    return self.request_encode_body(
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/request.py", line 170, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/poolmanager.py", line 375, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/connectionpool.py", line 780, in urlopen
    log.warning(
Message: "Retrying (%r) after connection broken by '%r': %s"
Arguments: (Retry(total=2, connect=None, read=None, redirect=None, status=None), SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')), '/api/5288891/envelope/')

Bounty

My suggestions on what might solve this are:

  1. Figuring out a way to stop wandb logging locally or minimize the amount of logging wandb is logging locally.
  2. Figure out what is exactly being logged and minimize the space.
  3. have the logging work even if all the folders are being symlinked. (imho this should work out of the box)
  4. figuring out a systematic and simple way to find where the stale file handles are coming from.

I am surprised moving everything to /shared/rsaas/miranda9/ and running experiments from there did not solve the issue.


cross:

Charlie Parker
  • 5,884
  • 57
  • 198
  • 323

2 Answers2

3

seems like the solution is to not log to weird places with symlinks but log to real paths and instead clean up the wandb local paths often to avoid disk quota errors in your HPC. Not my fav solution but gets it done :).

Wandb should fix this, the whole point of wandb is that it works out of the box and I don't have to do MLOps and I can focus on research.

likely best to see discussions here: https://github.com/wandb/wandb/issues/4409

Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
0

how do I have wandb only log online and not locally? (e.g. stop trying to log anything to ./wandb[or any secret place it might be logging to] since it's creating issues)

You could try wandb offline given by the wandb documentation to turn off logging:

The command wandb offline sets an environment variable, WANDB_MODE=offline. This stops any data from syncing from your machine to the remote wandb server. If you have multiple projects, they will all stop syncing logged data to W&B servers.

You could also check out this discussion on "wandb sync not logging in while running wandb local", where some people managed to figure out it had something to do with the --network host flag.

DialFrost
  • 1,610
  • 1
  • 8
  • 28