Tensorflow: Batching whole dataset (MNIST Tutorial)

Question

Following this tutorial: https://www.tensorflow.org/versions/r1.3/get_started/mnist/pros

I wanted to solve a classification problem with labeled images by myself. Since I'm not using the MNIST database, I spent days creating my own dataset inside tensorflow. It looks like this:

#variables
batch_size = 50
dimension = 784
stages = 10

#step 1 read Dataset
filenames = tf.constant(filenamesList)
labels = tf.constant(labelsList)

#step 2 create Dataset
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))

#step 3: parse every image in the dataset using `map`
def _parse_function(filename, label):
    #convert label to one-hot encoding
    one_hot = tf.one_hot(label, stages)

    #read image file
    image_string = tf.read_file(filename)
    image_decoded = tf.image.decode_image(image_string, channels=3)
    image = tf.cast(image_decoded, tf.float32)

    return image, one_hot

#step 4 final input tensor
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size) #batch_size = 100

iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()

images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()

for _ in range(10):
    dataset = dataset.shuffle(buffer_size = 100)
    dataset = dataset.batch(batch_size)
    iterator = dataset.make_one_shot_iterator()
    images, labels = iterator.get_next()

    images = tf.reshape(images, [batch_size,dimension]).eval()
    labels = tf.reshape(labels, [batch_size,stages]).eval()

    train_step.run(feed_dict={x: images, y_:labels})

Somehow using a higher batch_sizes will break python. What I'm trying to do is to train my neural network with new batches on each iteration. That's why Im also using dataset.shuffle(...). Using dataset.shuffle also breaks my Python.

What I wanted to do (because shuffle breaks) is to batch the whole dataset. By evaluating ('.eval()') I will get a numpy array. I will then shuffle the array with numpy.random.shuffle(images) and then pick up some the first elements to train it.

e.g.

for _ in range(1000):
    images = tf.reshape(images, [batch_size,dimension]).eval()
    labels = tf.reshape(labels, [batch_size,stages]).eval()

    #shuffle
    np.random.shuffle(images)
    np.random.shuffle(labels)

    train_step.run(feed_dict={x: images[0:train_size], y_:labels[0:train_size]})

But then here comes the problem that I can't batch the my whole dataset. It looks like that the data is too big for python to work with. How should I solve this differently?

Since I'm not using the MNIST database there isn't a function like mnist.train.next_batch(100) which comes handy for me.

mikkola · Answer 1 · 2018-03-13T20:11:30.463

Notice how you call shuffle and batch inside your for loop? This is wrong. Datasets in TF work in the style of functional programming, so you are actually defining a pipeline for preprocessing the data to feed into your model. In a way, you give a recipe that answers the question "given this raw data, which operations (map, etc.) should I do to get batches that I can feed into my neural network?"

Now you are modifying that pipeline for every batch! What happens is that the first iteration, the batch size is, say [32 3600]. The next iteration, the elements of this shape are batched again, to [32 32 3600], and so on.

There's a great tutorial on the TF website where you can find out more how Datasets work, but here are a few suggestions how you can resolve your problem.

Move the shuffling to right after "Step 2" in your code. Then you are shuffling the whole dataset so your batches will have a good mixture of examples. Also increase the buffer_size argument, this works in a different way than you probably assume. It's usually a good idea to shuffle as early as possible, as it can be a slow operation if you have a large dataset -- the shuffled part of dataset will have to be read into memory. Here it does not really matter whether you shuffle the filenames and labels, or the read images and labels -- but the latter will have more work to do since the dataset is larger by that time.
Move batching and the iterator generator to be the last steps, just before starting your training loop.
Don't use feed_dict with Dataset iterators to input data into your model. Instead, define your model in terms of the outputs of iterator.get_next() and omit the feed_dict argument. See more details from this Q&A: Tensorflow: create minibatch from numpy array > 2 GB

thank you for your response. I deleted shuffle and batch from my for loop. I just don't get how https://stackoverflow.com/questions/49053569/tensorflow-create-minibatch-from-numpy-array-2-gb solves my problem. I want to do the same in my inner for loop as in the MNIST tutorial. — Manh Khôi Duong, Mar 09 '18 at 21:18
e.g. picking out random pictures of my dataset and feed it into the placeholders. Then repeating the step. Evaluating (.eval()) will make my images, labels to Numpy Arrays where I can feed them to my placeholders. — Manh Khôi Duong, Mar 09 '18 at 21:24
@ManhKhôiDuong the entire point of using the `Dataset` API is to *not* use placeholders and `feed_dict` to feed data to your model. — mikkola, Mar 09 '18 at 21:36
I cant find a tensorflow class which is "Model" in the link you provided me. — Manh Khôi Duong, Mar 10 '18 at 12:43
@ManhKhôiDuong yes, you will have to use your own model definition. The linked answer only shows the correct principle how to use the `Dataset`. — mikkola, Mar 10 '18 at 12:47

score 0 · Accepted Answer · answered Mar 13 '18 at 18:59

Ive been getting through a lot of problems with creating tensorflow datasets. So I decided to use OpenCV to import images.

import opencv as cv
imgDataset = []
for i in range(len(files)):
    imgDataset.append(cv2.imread(files[i]))
imgDataset = np.asarray(imgDataset)

the shape of imgDataset is (num_img, height, width, col_channels). Getting the i-th image should be imgDataset[i].

shuffling the dataset and getting only batches of it can be done like this:

from sklearn.utils import shuffle
X,y = shuffle(X, y)
X_feed = X[batch_size]
y_feed = y[batch_size]

Then you feed X_feed and y_feed into your model

Tensorflow: Batching whole dataset (MNIST Tutorial)

2 Answers2