1

I am trying to experiment with remote executor runtime with the example provided on this link. https://github.com/tensorflow/federated/blob/master/tensorflow_federated/python/examples/remote_executor_example.py

If I using CPU based tensorflow, then everything works fine. However, for GPU based tensorflow the follow error occurs and aborts execution:

2020-03-29 16:27:22.904103: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-29 16:27:22.904807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 978 MB memory) -> physical GPU (device: 0, name: GRID V100DX-32C, pci bus id: 0000:02:00.0, compute capability: 7.0)
2020-03-29 16:27:22.995000: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: No unary variant device copy function found for direction: 1 and Variant type_index: tensorflow::data::(anonymous namespace)::DatasetVariantWrapper
[[{{node partitionedcall_args_0/_2}}]]

How do I solve this ? Have anyone faced similar issues ?

AMC
  • 2,642
  • 7
  • 13
  • 35
  • Please share a [mcve] in the post itself. – AMC Mar 30 '20 at 00:19
  • Does this answer your question? [MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen](https://stackoverflow.com/questions/44232898/memoryerror-in-tensorflow-and-successful-numa-node-read-from-sysfs-had-negativ) – AMC Mar 30 '20 at 00:20
  • @AMC I am able to solve the NUMA node issue. However, the issue with No unary variant device found for direction still exists. – pjletstrackit Mar 30 '20 at 04:07
  • This is a known internal bug; we are working on resolving it and expect it to be fixed in the next pip-package release. – Keith Rush May 11 '20 at 23:41

1 Answers1

1

As of this commit, this issue should be fixed in TFF at master. Options for mitigating on your side include:

  1. Building TFF from master using Bazel, as documented here.
  2. Waiting for the next pip package release, scheduled to be next week.
  3. Manually editing the site-packages on your remote worker to explicitly pin dataset instantiation on the CPU, as in the linked change.
Keith Rush
  • 1,360
  • 7
  • 6