2

I'm training model with tensorflow==2.7.0 distributively on gcloud ai platform.

I'm using ParameterServerStrategy strategy, with multiple workers.

One thing I'm confused, and couldn't find answer, is how to properly set number of steps each worker runs in one epoch.

consider code snippet:

def dataset_fn(input_context):
  ...

data_input = tf.keras.utils.experimental.DatasetCreator(dataset_fn=dataset_fn)

model.fit(
  data_input,
  epochs=...,
  steps_per_epoch=...)

is steps_per_epoch:

  1. number of steps each worker runs

or

  1. number of times master dispatches execution steps for workers?

Let's say dataset size is 1,000,000, and batch_size=100, and there are 10 workers. And in one epoch I want to process each instance in the dataset once, then

should I set steps_per_epoch=1,000,000 / 100 = 10,000 or should I set it to steps_per_epoch=1,000,000 / 100 / 10 = 1,000?

govordovsky
  • 359
  • 2
  • 17

0 Answers0