I'm training model with tensorflow==2.7.0 distributively on gcloud ai platform.
I'm using ParameterServerStrategy
strategy, with multiple workers.
One thing I'm confused, and couldn't find answer, is how to properly set number of steps each worker runs in one epoch.
consider code snippet:
def dataset_fn(input_context):
...
data_input = tf.keras.utils.experimental.DatasetCreator(dataset_fn=dataset_fn)
model.fit(
data_input,
epochs=...,
steps_per_epoch=...)
is steps_per_epoch
:
- number of steps each worker runs
or
- number of times master dispatches execution steps for workers?
Let's say dataset size is 1,000,000
, and batch_size=100
, and there are 10
workers.
And in one epoch I want to process each instance in the dataset once, then
should I set steps_per_epoch=1,000,000 / 100 = 10,000
or should I set it to steps_per_epoch=1,000,000 / 100 / 10 = 1,000
?