I wonder if there is any affect on program when we call these methods
in different sequence?
Yes, there is a difference. Almost always, shuffle()
should be called before batch()
as we want to shuffle records not batches.
The transformations of a tf.data.Dataset
are applied in the same sequence that they are called.
Batch combines consecutive elements of its input into a single, batched element in the output.
import tensorflow as tf
import numpy as np
dataset = tf.data.Dataset.from_tensor_slices(np.arange(19))
for batch in dataset.batch(5):
print(batch)
Output:
tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int64)
tf.Tensor([5 6 7 8 9], shape=(5,), dtype=int64)
tf.Tensor([10 11 12 13 14], shape=(5,), dtype=int64)
tf.Tensor([15 16 17 18], shape=(4,), dtype=int64)
When we shuffle the data before feeding it to a network. This fills a buffer with buffer_size
elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size should be equal to the full size of the dataset.
for batch in dataset.shuffle(5).batch(5):
print(batch)
Output:
tf.Tensor([2 0 1 4 8], shape=(5,), dtype=int64)
tf.Tensor([ 9 3 7 6 11], shape=(5,), dtype=int64)
tf.Tensor([12 14 15 5 13], shape=(5,), dtype=int64)
tf.Tensor([17 18 16 10], shape=(4,), dtype=int64)
You can see that the result is not uniform but good enough.
However, if you apply the methods in a different order, you will get an unexpected result. It shuffles batches, not records.
for batch in dataset.batch(5).shuffle(5):
print(batch)
Output:
tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int64)
tf.Tensor([5 6 7 8 9], shape=(5,), dtype=int64)
tf.Tensor([15 16 17 18], shape=(4,), dtype=int64)
tf.Tensor([10 11 12 13 14], shape=(5,), dtype=int64)