1

There are a lot of methods in tf.data.Dataset, like batch(), shard(), shuffle(), prefetch(), map()..., etc. Usually while we implement an input_fn we will call them depends on our wish.

I wonder if there is any affect on program when we call these methods in different sequence? For instance, are they same in the following two calling sequence?

dataset = dataset.shuffle().batch()
dataset = dataset.batch().shuffle()
nexgus
  • 11
  • 1
  • probably duplicated with https://stackoverflow.com/questions/50437234/tensorflow-dataset-shuffle-then-batch-or-batch-then-shuffle – zihaozhihao Sep 26 '19 at 05:35
  • @zihaozhihao yes, it is duplicated, i'm so sorry. besides, https://stackoverflow.com/questions/56944856/tensorflow-dataset-questions-about-shuffle-batch-and-repeat?noredirect=1&lq=1 is a good question, too. – nexgus Sep 26 '19 at 06:42
  • Yes, it is! Thanks for sharing :) – zihaozhihao Sep 26 '19 at 06:44

1 Answers1

1

I wonder if there is any affect on program when we call these methods in different sequence?

Yes, there is a difference. Almost always, shuffle() should be called before batch() as we want to shuffle records not batches.

The transformations of a tf.data.Dataset are applied in the same sequence that they are called.

Batch combines consecutive elements of its input into a single, batched element in the output.

import tensorflow as tf
import numpy as np

dataset = tf.data.Dataset.from_tensor_slices(np.arange(19))
for batch in dataset.batch(5):
  print(batch)

Output:

tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int64)
tf.Tensor([5 6 7 8 9], shape=(5,), dtype=int64)
tf.Tensor([10 11 12 13 14], shape=(5,), dtype=int64)
tf.Tensor([15 16 17 18], shape=(4,), dtype=int64)

When we shuffle the data before feeding it to a network. This fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size should be equal to the full size of the dataset.

for batch in dataset.shuffle(5).batch(5):
  print(batch)

Output:

tf.Tensor([2 0 1 4 8], shape=(5,), dtype=int64)
tf.Tensor([ 9  3  7  6 11], shape=(5,), dtype=int64)
tf.Tensor([12 14 15  5 13], shape=(5,), dtype=int64)
tf.Tensor([17 18 16 10], shape=(4,), dtype=int64)

You can see that the result is not uniform but good enough.

However, if you apply the methods in a different order, you will get an unexpected result. It shuffles batches, not records.

for batch in dataset.batch(5).shuffle(5):
  print(batch)

Output:

tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int64)
tf.Tensor([5 6 7 8 9], shape=(5,), dtype=int64)
tf.Tensor([15 16 17 18], shape=(4,), dtype=int64)
tf.Tensor([10 11 12 13 14], shape=(5,), dtype=int64)