batching huge data in tensorflow

Question

I am trying to perform binary classification using the code/tutorial from https://github.com/eisenjulian/nlp_estimator_tutorial/blob/master/nlp_estimators.py

print("Loading data...")
(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size)
print(len(y_train), "train sequences")
print(len(y_test), "test sequences")

print("Pad sequences (samples x time)")
x_train = sequence.pad_sequences(x_train_variable, 
                             maxlen=sentence_size, 
                             padding='post', 
                             value=0)
x_test = sequence.pad_sequences(x_test_variable, 
                            maxlen=sentence_size, 
                            padding='post', 
                            value=0)
print("x_train shape:", x_train.shape)
print("x_test shape:", x_test.shape)

def train_input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train))
    dataset = dataset.shuffle(buffer_size=len(x_train_variable))
    dataset = dataset.batch(100)
    dataset = dataset.map(parser)
    dataset = dataset.repeat()
    iterator = dataset.make_one_shot_iterator()
    return iterator.get_next()

def eval_input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((x_test, x_len_test, y_test))
    dataset = dataset.batch(100)
    dataset = dataset.map(parser)
    iterator = dataset.make_one_shot_iterator()
    return iterator.get_next()



def cnn_model_fn(features, labels, mode, params):    
    input_layer = tf.contrib.layers.embed_sequence(
    features['x'], vocab_size, embedding_size,
    initializer=params['embedding_initializer'])

    training = mode == tf.estimator.ModeKeys.TRAIN
    dropout_emb = tf.layers.dropout(inputs=input_layer, 
                                rate=0.2, 
                                training=training)

    conv = tf.layers.conv1d(
        inputs=dropout_emb,
        filters=32,
        kernel_size=3,
        padding="same",
        activation=tf.nn.relu)

   # Global Max Pooling
   pool = tf.reduce_max(input_tensor=conv, axis=1)

   hidden = tf.layers.dense(inputs=pool, units=250, activation=tf.nn.relu)

   dropout_hidden = tf.layers.dropout(inputs=hidden, 
                                   rate=0.2, 
                                   training=training)

   logits = tf.layers.dense(inputs=dropout_hidden, units=1)

   # This will be None when predicting
    if labels is not None:
    labels = tf.reshape(labels, [-1, 1])


    optimizer = tf.train.AdamOptimizer()

    def _train_op_fn(loss):
        return optimizer.minimize(
        loss=loss,
        global_step=tf.train.get_global_step())

    return head.create_estimator_spec(
        features=features,
        labels=labels,
        mode=mode,
        logits=logits, 
        train_op_fn=_train_op_fn)

cnn_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn,
                                    model_dir=os.path.join(model_dir, 'cnn'),
                                    params=params)
train_and_evaluate(cnn_classifier)

The example here loads data from IMDB movie reviews. I have my own dataset in the form of text which is approx 2GB huge. Now in this example the line (x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size) tries to load whole dataset in memory. If I try to do the same I run out of memory. How can I restructure this logic to read data in batches from my disk?

score 1 · Accepted Answer · answered Aug 13 '18 at 04:43

You want to change the dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train)) line. There are lots of ways of creating a dataset - from_tensor_slices is the easiest, but won't work on its own if you can't load the entire dataset to memory.

The best way depends on how you have the data stored, or how you want to store it/manipulate it. The simplest in my opinion with very little down-side (unless running on multiple GPUs) is to have the original dataset just give indices to data, and write a normal numpy function for loading the ith example.

dataset = tf.data.Dataset.from_tensor_slices(tf.range(epoch_size))

def tf_map_fn(i):
    def np_map_fn(i):
        return load_ith_example(i)

    inp1, inp2 = tf.py_func(np_map_fn, (i,), Tout=(tf.float32, tf.float32), stateful=False)
    # other preprocessing/data augmentation goes here.

    # unbatched sizes
    inp1.set_shape(shape1)
    inp2.set_shape(shape2)
    return inp1, inp2

dataset = dataset.repeat().shuffle(epoch_size).map(tf_map_fn, 8)

dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(1)  # start loading data as GPU trains on previous batch

inp1, inp2 = dataset.make_one_shot_iterator().get_next()

Here I assume your outputs are float32 tensors (Tout=...). set_shape calls aren't strictly necessary, but if you know the shape it'll do better error checks.

So long as your preprocessing doesn't take longer than your network to run, this should run just as fast as any other method on a single GPU machine.

The other obvious way is to convert your data to tfrecords, but that'll take up more space on disk and is more of a pain to manage if you ask me.

is there an example which I can implement. BTW how would this map function in this case if we consider IMDB dataset ? here is the implementation of load function in keras https://github.com/keras-team/keras/blob/master/keras/datasets/imdb.py — Rohit, Aug 14 '18 at 19:29
I posted a similar, more extended answer [here](https://stackoverflow.com/questions/45828616/streaming-large-training-and-test-files-into-tensorflows-dnnclassifier/45829855#45829855). I'm not familiar with imdb, but the example in this answer only requires you to implement `load_ith_example`. You may have to change how you store the data on disk to do such, or consider writing them as tfrecords as explained in the other answer just linked. — DomJack, Aug 14 '18 at 23:06

batching huge data in tensorflow

1 Answers1