2

I'm currently trying to run some code on multiple tpu cores on Google Colab but I seem to get an error when the synchronization code (xm.rendezvous) is called at the end of the target function but now when the synchronization code is at the top. Here's an example:

# "Map function": acquires a corresponding Cloud TPU core, creates a tensor on it,
# and prints its core
def simple_map_fn(index, flags):
  #   xm.rendezvous('init') # place rendezvous here instead of at the bottom works fine.

  # Acquires the (unique) Cloud TPU core corresponding to this process's index
  device = xm.xla_device()
  ordinal = xm.get_ordinal()
  local_ordinal = xm.get_ordinal()

  print(f"index {index}, process device {device}, local ordinal {local_ordinal}, ordinal {ordinal}")


  # Barrier to prevent master from exiting before workers connect.
  xm.rendezvous('leave')

# Spawns eight of the map functions, one for each of the eight cores on
# the Cloud TPU
flags = {}

xmp.spawn(simple_map_fn, args=(flags,), nprocs=8, start_method='fork')

When I run the code above in Google Colab like in this notebook, I get the following error:

Exception in device=TPU:7: tensorflow/compiler/xla/xla_client/mesh_service.cc:294 : Failed to meet rendezvous 'leave': Socket closed (14)

Any idea why the rendezvous fails when it is placed at the bottom of the target function?

btomtom5
  • 842
  • 8
  • 16

1 Answers1

1

After some careful investigation, I found that the issue doesn't occur when the google colab instance is running as a "high ram" instance. I am concluding that the most likely reason for this error is an out of ram error.

btomtom5
  • 842
  • 8
  • 16