8

I have been allocated multiple Google Cloud TPUs in the us-central1-f region. The machine types are all v2-8.

How can I utilize all my TPUs to train a single model?

The us-central1-f region doesn't support pods, so using pods doesn't seem like the solution. Even if pods were available, the number of v2-8 units that I have does not match any of the pod TPU slice sizes (16, 64, 128, 256), so I couldn't use them all in a single pod.

Ameet Deshpande
  • 496
  • 8
  • 22
Kevin
  • 4,070
  • 4
  • 45
  • 67
  • Any specific reason you cannot move to `us-central1-a` which has TPU Pods? – Alex Ilchenko Jun 17 '19 at 00:32
  • The TPUs I've been given are specifically for `us-central1-f`. As in, if I move them, I'll have to pay for their usage, as opposed to it being free. – Kevin Jun 17 '19 at 01:03

2 Answers2

4

Though I can't find documentation which explicitly answers this question, I have read multiple articles and questions and come to the conclusion that if you are using v2-8 or v3-8 TPUs, it is not possible to use multiple of them at a time. You will have to use larger machines like v2-32 or v3-32 to ensure you have access to more cores, and the TFRC program does not provide that for free.

References:

Ameet Deshpande
  • 496
  • 8
  • 22
1

I believe you cannot easily do this. If you want to train a single model using multiple TPUs, you would need to have access to a region with TPU Pods. Otherwise you can do the obvious thing: train the same model on different TPUs but with different hyperparameters as a way to do grid search OR you can train multiple weak learners and then manually combine them.

Alex Ilchenko
  • 322
  • 1
  • 3