RPC failed with status = "Unavailable: Socket closed" Error when training FairSeq RoBERTa on Cloud TPU using PyTorch

Question

I followed the tutorials "Pre-training FairSeq RoBERTa on Cloud TPU using Pytorch" to setup a Preemptible (v2-8) TPU env and train my RoBERTa model. The PyTorch env is based on torch-xla-1.6 as instructed by the document. However, it does not output any training log as usual in GPU and it throws the RPC failure warning (see below - network endpoint is removed here) twice in 2-3 days (in 12 hours gap).

My training steps per epoch is 161,529. According to the document, v2-8 will take 80 hours for 5 epochs as i configured. However, My job seems hanging there.

Any advice please ?

 W    4566 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1599580717.037250202","description":"Error received from peer ipv4:<my_network_endpoint>:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC

as far as I remember it may caused by bug in tf 2.3.0. Try using 2.2.0 instead — Andrey, Sep 09 '20 at 09:19
based on the error message - looks like pytorch uses tensorflow — Andrey, Sep 09 '20 at 12:18

jysohn · Answer 1 · 2020-11-09T15:25:52.680

-1

It sounds like in this case your TPU may have been getting preempted. Please try using a non-preemptible TPU.

edited Nov 09 '20 at 15:25

answered Nov 09 '20 at 00:22

jysohn

871
6
9

Please don't write answers in a way that makes them look like questions. There are systems in place to check for Not An Answer (NAA) posts. This answer was falsely picked up by one of them. Please consider editing it to look more like an answer. Rephrase the "Could you try using a non-preemptible TPU?" line. – Sabito stands with Ukraine Nov 09 '20 at 00:28
1

This error occurs even with non-preemptible TPUs. – Ameet Deshpande May 13 '21 at 02:34

RPC failed with status = "Unavailable: Socket closed" Error when training FairSeq RoBERTa on Cloud TPU using PyTorch

1 Answers1