2

I followed the tutorials "Pre-training FairSeq RoBERTa on Cloud TPU using Pytorch" to setup a Preemptible (v2-8) TPU env and train my RoBERTa model. The PyTorch env is based on torch-xla-1.6 as instructed by the document. However, it does not output any training log as usual in GPU and it throws the RPC failure warning (see below - network endpoint is removed here) twice in 2-3 days (in 12 hours gap).

My training steps per epoch is 161,529. According to the document, v2-8 will take 80 hours for 5 epochs as i configured. However, My job seems hanging there.

Any advice please ?

 W    4566 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1599580717.037250202","description":"Error received from peer ipv4:<my_network_endpoint>:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
cronoik
  • 15,434
  • 3
  • 40
  • 78
user3786340
  • 190
  • 1
  • 8

1 Answers1

-1

It sounds like in this case your TPU may have been getting preempted. Please try using a non-preemptible TPU.

jysohn
  • 871
  • 6
  • 9
  • Please don't write answers in a way that makes them look like questions. There are systems in place to check for Not An Answer (NAA) posts. This answer was falsely picked up by one of them. Please consider editing it to look more like an answer. Rephrase the "Could you try using a non-preemptible TPU?" line. – Sabito stands with Ukraine Nov 09 '20 at 00:28
  • 1
    This error occurs even with non-preemptible TPUs. – Ameet Deshpande May 13 '21 at 02:34