In Tensorflow, when use dataset.shuffle(1000), am I only using 1000 data from my whole dataset?

Question

When using the following code to train my network:

classifier = tf.estimator.Estimator(
    model_fn=my_neural_network_model, 
    model_dir=some_path_to_save_checkpoints,
    params={
        some_parameters
    }
)
classifier.train(input_fn=data_train_estimator, steps=step_num)

where data_train_estimator is defined as:

def data_train_estimator():
    dataset = tf.data.TextLineDataset(train_csv_file).map(_parse_csv_train)  
    dataset = dataset.batch(100)
    dataset = dataset.shuffle(1000)
    dataset = dataset.repeat()
    iterator = dataset.make_one_shot_iterator() 
    feature, label = iterator.get_next()
    return feature, label

How does dataset.shuffle(1000) actually work?

More specifically,

Let's say I have 20000 images, batch size = 100, shuffle buffer size = 1000, and I train the model for 5000 steps.

1. For every 1000 steps, am I using 10 batches(of size 100), each independently taken from the same 1000 images in the shuffle buffer?

2.1 Does the shuffle buffer work like a moving window?

2.2 Or, does it randomly pick 1000 out of the 5000 images (with or without replacement)?

3. In the whole 5000 steps, how many different states has the shuffle buffer been in?

user2653663 · Accepted Answer · 2018-09-11T10:49:09.313

1

With shuffle_buffer=1000 you will keep a buffer in memory of 1000 points. When you need a data point during training, you will draw the point randomly from points 1-1000. After that there is only 999 points left in the buffer and point 1001 is added. The next point can then be drawn from the buffer.

To answer you in point form:

For every 1000 steps, am I using 10 batches(of size 100), each independently taken from the same 1000 images in the shuffle buffer?

No the image buffer will stay constant, but drawn images will be replaced with images not used before in that epoch.

Does the shuffle buffer work like a moving window? Or, does it randomly pick 1000 out of the 5000 images (with or without replacement)?

It draws without replacement and doesn't really work like a moving window, since drawn images are replaced dynamically.

In the whole 5000 steps, how many different states has the shuffle buffer been in?

Close to n_images * n_steps. So 25,000,000 in this case. There might be a few states that have been seen before by chance, but it is unlikely.

You might also find this question useful.

edited Sep 11 '18 at 10:49

answered Sep 11 '18 at 08:02

user2653663

2,818
1
18
22

Thanks for you explanation! However I just want to make two things clear: 1. when you say "grow", I guess you mean the content of that buffer is dynamically changing, instead of the size of the buffer is "growing". 2. How does buffer have been in 25,000,000 states? In my opinion it should be around 100*2*(5000/100) = 10000 states, as for every batch 100 elements are taken out of the buffer and then 100 new elements are put into it, in 5000 steps there should be only 50 batches. I guess you mean 5000 epochs, thus there should be 20,000x5,000x2 states, am I correct? – user10253771 Sep 11 '18 at 10:06
1

As far as I understand when you construct a batch of 100, you draw them one at a time and add to the buffer. I'm not 100% confident on this, but that is what I have gathered from other answers. I got a bit confused by your terminology. It seems like you are using steps as number of points used for training, with replacement. So 1 epoch of 20,000 images with a batch size of 100 would be 20,000 steps? In that case there should be 20,000 unique buffer states. If the entire batch is drawn in one go, that should be 200. I'm not sure where you get a factor of 2. – user2653663 Sep 11 '18 at 10:47
1

Yes, I used a wrong term. "step" in this case should be the number of batches, as the weights are updated batch by batch. The factor of 2 came from the fact that the buffer will remove an element, which creates a [999 elements and an empty space] state; and when a new element comes in, it's in a [999 old elements and a new element] state. Therefore every time we use a data point, the buffer will go through 2 states. – user10253771 Sep 12 '18 at 00:23

In Tensorflow, when use dataset.shuffle(1000), am I only using 1000 data from my whole dataset?

1 Answers1