torch_xla rendezvous at the end of a function causes "Failed to meet rendezvous" error

Question

I'm currently trying to run some code on multiple tpu cores on Google Colab but I seem to get an error when the synchronization code (xm.rendezvous) is called at the end of the target function but now when the synchronization code is at the top. Here's an example:

# "Map function": acquires a corresponding Cloud TPU core, creates a tensor on it,
# and prints its core
def simple_map_fn(index, flags):
  #   xm.rendezvous('init') # place rendezvous here instead of at the bottom works fine.

  # Acquires the (unique) Cloud TPU core corresponding to this process's index
  device = xm.xla_device()
  ordinal = xm.get_ordinal()
  local_ordinal = xm.get_ordinal()

  print(f"index {index}, process device {device}, local ordinal {local_ordinal}, ordinal {ordinal}")


  # Barrier to prevent master from exiting before workers connect.
  xm.rendezvous('leave')

# Spawns eight of the map functions, one for each of the eight cores on
# the Cloud TPU
flags = {}

xmp.spawn(simple_map_fn, args=(flags,), nprocs=8, start_method='fork')

When I run the code above in Google Colab like in this notebook, I get the following error:

Exception in device=TPU:7: tensorflow/compiler/xla/xla_client/mesh_service.cc:294 : Failed to meet rendezvous 'leave': Socket closed (14)

Any idea why the rendezvous fails when it is placed at the bottom of the target function?

To anyone else reading this question https://github.com/pytorch/xla/issues/2190 — Björn Lindqvist, Jun 21 '20 at 07:27

score 1 · Answer 1 · answered Jun 10 '20 at 00:59

1

After some careful investigation, I found that the issue doesn't occur when the google colab instance is running as a "high ram" instance. I am concluding that the most likely reason for this error is an out of ram error.

answered Jun 10 '20 at 00:59

btomtom5

842
8
16

torch_xla rendezvous at the end of a function causes "Failed to meet rendezvous" error

1 Answers1