Error while using GPU based remote execution with Tensorflow federated

Question

I am trying to experiment with remote executor runtime with the example provided on this link. https://github.com/tensorflow/federated/blob/master/tensorflow_federated/python/examples/remote_executor_example.py

If I using CPU based tensorflow, then everything works fine. However, for GPU based tensorflow the follow error occurs and aborts execution:

2020-03-29 16:27:22.904103: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-29 16:27:22.904807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 978 MB memory) -> physical GPU (device: 0, name: GRID V100DX-32C, pci bus id: 0000:02:00.0, compute capability: 7.0)
2020-03-29 16:27:22.995000: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: No unary variant device copy function found for direction: 1 and Variant type_index: tensorflow::data::(anonymous namespace)::DatasetVariantWrapper
[[{{node partitionedcall_args_0/_2}}]]

How do I solve this ? Have anyone faced similar issues ?

Does this answer your question? [MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen](https://stackoverflow.com/questions/44232898/memoryerror-in-tensorflow-and-successful-numa-node-read-from-sysfs-had-negativ) — AMC, Mar 30 '20 at 00:20
@AMC I am able to solve the NUMA node issue. However, the issue with No unary variant device found for direction still exists. — pjletstrackit, Mar 30 '20 at 04:07
This is a known internal bug; we are working on resolving it and expect it to be fixed in the next pip-package release. — Keith Rush, May 11 '20 at 23:41

score 1 · Answer 1 · answered May 12 '20 at 23:34

As of this commit, this issue should be fixed in TFF at master. Options for mitigating on your side include:

Building TFF from master using Bazel, as documented here.
Waiting for the next pip package release, scheduled to be next week.
Manually editing the site-packages on your remote worker to explicitly pin dataset instantiation on the CPU, as in the linked change.

Error while using GPU based remote execution with Tensorflow federated

1 Answers1