6

I had a question about the use of batch, repeat and shuffle with tf.Dataset.

It is not clear to me exactly how repeat and shuffle are used. I understand that .batch will dictate how many training examples will undergo stochastic gradient descent, the uses of .repeat and .shuffle are still not clear to me.

First Question

Even after reviewing here and here, .repeat is used to reiterate over the dataset once a tf.errors.OutOfRangeError is thrown. Therefore, in my code does that mean I no longer have to implement:

try:
    while True:
        _ = sess.run(self.optimizer)

except tf.errors.OutOfRangeError:
        pass

because .repeat will automatically repeat the dataset once it is exhausted? When does it stop? or will it never stop and you just have to exit out of the while True loop once a certain number of batches (say 1000) have passed?

Second Question

Secondly, the use .shuffle makes no sense to me. Does .shuffle.batch() imply that I have, say, 100,000 samples, put 1000 randomly in a buffer with .shuffle, then batch say, 100 of them with .batch(). From my understanding the next batch will use 999 of those samples and place 1 new one in the buffer. So if my samples have no order to them, then .shuffle should be avoided all together? And if .batch is used, it would still batch 100 from those 999+1 in the buffer?

Third Question

And lastly, if I am using a separate td.dataset object for testing, what order of .shuffle.batch() should I consider? Right now I use:

sess.run(self.test_init)
try:
    while True:
        accuracy_batch = sess.run(self.accuracy)

except tf.errors.OutOfRangeError:
    pass

With:

test_data = self.test_dataset.shuffle(self.batch_size).batch(self.batch_size)

I have over 110,000 training examples at my disposal, so self.batch_size will set the number of samples I want to use to test my accuracy. So, if I wanted to just test on the whole test dataset I wouldn't use .batch? But since I have it iterating over the whole dataset with while True, it makes no difference? With the use of .shuffle I noticed my accuracies changed, but without it they were very similar. This makes me think .shuffleis randomizing the batch and may be reusing training examples?

Jamie Dimon
  • 467
  • 4
  • 16
  • 1
    This question have no relation to the `batch-file` tag you used. You may read the description of a tag by just hovering the mouse over it... I suggest you to edit the tags and remove the `batch-file` one... – Aacini Jul 09 '19 at 03:28
  • I've answered each of these but worth bearing in mind for next time that this probably ought to really be 3 separate Stack Overflow questions. Best to keep each individual one as specific as you can. – Stewart_R Jul 09 '19 at 11:30

1 Answers1

7

First Question:

That's correct - if you feed a dataset you no longer need to catch the OutOfRangeError.

repeat() takes an optional argument for the number of times it should repeat. This means repeat(10) will iterate over the entire dataset 10 times. If you choose to omit the argument then it will repeat indefinately

Second Question

Shuffle() (if used) should be called before batch() - we want to shuffle records not batches.

The buffer is first filled by adding your records in order then, once full, a random one is selected and emitted and a new record read from the original source.

If you have something like

ds.shuffle(1000).batch(100)

then in order to return a single batch, this last step is repeated 100 times (maintaining the buffer at 1000). Batching is a separate operation.

Third question

Generally we don't shuffle a test set at all - only the training set (We evaluate using the entire test set anyway, right? So why shuffle?).

So, if I wanted to just test on the whole test dataset I wouldn't use .batch

Hmm - not so (at least not always). You would certainly need to use batch if your whole test dataset didnt fit into memory - a common occurrence. You would want to test the whole dataset but to run the numbers in manageable bites!

Stewart_R
  • 13,764
  • 11
  • 60
  • 106
  • Thanks for your comment. The only issue I'm still your answer to my second question. I don't understand why .shuffle is used and in what circumstances it should be used. What is the point of .batch(100) obtained from an ordered set of examples using .shuffle(1000)? When I .batch(100) am I not obtaining 100 random examples? And if I randomize my training set beforehand, drawing 100 samples in order is the same as drawing 100 random examples anyways. Is what Im asking make sense? – Jamie Dimon Jul 10 '19 at 17:42
  • 2
    `batch(100)` gets the *next* 100 records in order. By itself it does not shuffle or randomise the order in any way. – Stewart_R Jul 10 '19 at 17:48
  • So .shuffle(1000) will draw and shuffle 1000 items from the training set, and .batch(100) will simply draw 100 ordered items from the 1000 shuffled batch? So when I return and run the next iteration, will it draw a brand new 1000 and batch 100? Or will it continue and draw the next batch from the original 1000? Will it then only use 1000 examples? Or will it continue that as well until the training set is exhausted? Sorry for the extra questions! – Jamie Dimon Jul 12 '19 at 04:01
  • 1
    @JamieDimon Follow up questions should normally be NEW questions on stack overflow. Nonetheless: shuffle(1000) does not "draw 1000". It shuffles the *whole* set by loading the first 1000 in memory, then picking a random one and loading the next then repeating till its passed over whole set. By default, it shuffles each time but we can pass an optional `reshuffle_each_iteration=False` if we want it to shuffle only once then reuse – Stewart_R Jul 12 '19 at 06:16
  • Should also just clarify that `batch(100)` also works over whole dataset. My comments above were intended to highlight the lack of shuffling by illustrating how the *first* batch is generated. For clarity, `batch(100)` will turn, for example, a dataset of 500 records with shape `(2, 3)` into a dataset of 5 records with shape `(100,2,3)`. Does that make sense? – Stewart_R Jul 12 '19 at 06:19
  • Many thanks for answering my questions. Better to ask follow up questions that directly pertain to my original questions so that others who have similar problems wont be scattered around looking for the answers. – Jamie Dimon Jul 12 '19 at 19:53