How to find and run the largest batch in a dataset before starting training

Question

Question

In Tensorflow, I frequently run into OOM errors during the first epoch of training. However, the large nature of the network causes the first epoch to take around an hour, far to long to test new hyper-parameters quickly.

Ideally, I'd like to be able to sort the iterator so that I can just run get_next() once on the largest batch.

How can I do this? Or perhaps there is a better way to implement fail early?

The iterator is in the format: (source, tgt_in, tgt_out, key_weights, source_len, target_len) where I'm looking to sort by target length. It is padded and batched before being returned.

The dataset is a list of sentences, bucketed with similar lengths. I would like to find the largest batch in the iterator and run only it.

Some Code

The below code would work if the initializer didn't shuffle the iterator every time, thus destroying the information gained about the position of the largest batch. I'm not quite sure how to modify it -- as soon as one reads the length of the batch using get_next(), it has already been "popped" and can't be used as input into the model anymore.

def verify_hparams():
    train_sess.run(train_model.iterator.initializer)
    max_index = -1
    max_len = 0
    for batch in itertools.count():
        try:
            batch_len = np.amax(train_sess.run(train_model.iterator.get_next()[-1]))
            if batch_len > max_len:
                max_len = batch_len
                max_index = batch

        except tf.errors.OutOfRangeError:
            num_batches = batch + 1
            break

    for batch in range(-1, num_batches-1):
        try:
            if batch is max_index:
                _, _ = loaded_train_model.train(train_sess)
            else:
                train_sess.run(train_model.iterator.get_next())

        except tf.errors.OutOfRangeError:
            break

    return num_batches

bremen_matt · Answer 1 · 2018-02-02T12:11:06.307

1

What you need is a "peek" operation. Most languages have iterators which allow you to peek and see if there is more data (something like iterator.hasNext()). But the functionality you are asking for is essentially something like iterator.sizeOfNext(). To my knowledge, the tensorflow iterators don't have such functionality.

Furthermore, such functionality is unlikely to be add because I can imagine there are generators which can't provide such functionality, and so adding this feature would break backwards compatibility.

edited Feb 02 '18 at 12:11

answered Feb 02 '18 at 12:06

bremen_matt

6,902
7
42
90

What would you suggest as a solution? Spending an hour to get an OOM exception is way too slow for me. – Evan Weissburg Feb 02 '18 at 12:06
You are probably going to have to scan through all of the datasets to find the smallest one. – bremen_matt Feb 02 '18 at 12:08
But there is another facet that doesn't make sense to me here... If you want smaller datasets to train on, then you should make a generator that only spits out batches of size that you consider to be appropriate. Otherwise, you could always just consider a subset of the values returned by get_next() – bremen_matt Feb 02 '18 at 12:09
The issue here is that each batch has equal size, but different length. Each batch is a collection of sentences, padded up to the same sentence length with filler. Larger sentences are in a batch together. – Evan Weissburg Feb 02 '18 at 12:11
I think there is something that you should address in your question, that is, whether or not you are willing to "discard" datasets that are too large. If so, then you can scan through all of the datasets, only keeping the smallest one. Then you can train on that dataset. Would that be an option? – bremen_matt Feb 02 '18 at 12:16
I only have one dataset organized into batches. I'll clarify in the question a bit more. – Evan Weissburg Feb 02 '18 at 12:21
Sorry. Poor wording on my part. I meant to say "batches" in the above comments, not "dataset" – bremen_matt Feb 02 '18 at 12:22
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/164402/discussion-between-evan-weissburg-and-bremen-matt). – Evan Weissburg Feb 02 '18 at 12:23

How to find and run the largest batch in a dataset before starting training

1 Answers1