12

I have a directory of images, and a separate file matching image filenames to labels. So the directory of images has files like 'train/001.jpg' and the labeling file looks like:

train/001.jpg 1
train/002.jpg 2
...

I can easily load images from the image directory in Tensor Flow by creating a filequeue from the filenames:

filequeue = tf.train.string_input_producer(filenames)
reader = tf.WholeFileReader()
img = reader.read(filequeue)

But I'm at a loss for how to couple these files with the labels from the labeling file. It seems I need access to the filenames inside the queue at each step. Is there a way to get them? Furthermore, once I have the filename, I need to be able to look up the label keyed by the filename. It seems like a standard Python dictionary wouldn't work because these computations need to happen at each step in the graph.

bschreck
  • 724
  • 7
  • 18

5 Answers5

13

Given that your data is not too large for you to supply the list of filenames as a python array, I'd suggest just doing the preprocessing in Python. Create two lists (same order) of the filenames and the labels, and insert those into either a randomshufflequeue or a queue, and dequeue from that. If you want the "loops infinitely" behavior of the string_input_producer, you could re-run the 'enqueue' at the start of every epoch.

A very toy example:

import tensorflow as tf

f = ["f1", "f2", "f3", "f4", "f5", "f6", "f7", "f8"]
l = ["l1", "l2", "l3", "l4", "l5", "l6", "l7", "l8"]

fv = tf.constant(f)
lv = tf.constant(l)

rsq = tf.RandomShuffleQueue(10, 0, [tf.string, tf.string], shapes=[[],[]])
do_enqueues = rsq.enqueue_many([fv, lv])

gotf, gotl = rsq.dequeue()

with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    tf.train.start_queue_runners(sess=sess)
    sess.run(do_enqueues)
    for i in xrange(2):
        one_f, one_l = sess.run([gotf, gotl])
        print "F: ", one_f, "L: ", one_l

The key is that you're effectively enqueueing pairs of filenames/labels when you do the enqueue, and those pairs are returned by the dequeue.

dga
  • 21,757
  • 3
  • 44
  • 51
  • Okay great that's exactly what I needed! I hadn't thought to just match the two in python and shuffle it beforehand- I was just trying to use the code from the CIFAR tutorial which loads the file and then shuffles it afterwards. – bschreck Dec 03 '15 at 01:36
  • Actually I just tried it and I think my filenames list is too large. Using this code it just hangs, but works when I reduce the number of elements in my list. There are 87,000 files by the way. – bschreck Dec 03 '15 at 03:48
  • Interesting - it shouldn't hang, really. Did you increase the randomshufflequeue max to be large enough to handle the number of things you're putting into it? I'll caveat that I've never tried a random shuffle queue that large. :) If you want to save memory, you could rewrite the file as csv, use a textlinereader piped to a csv decoder, and then throw those into a queue, with a queuerunner to keep it running. Lot of work for only ~1MB of filenames, though. – dga Dec 03 '15 at 05:28
  • Oh I didn't realize I needed the capacity to be as big as number of files. I thought it would just wait until enough were dequeued before adding more. That seems to work, but I still have the problem then of how to read the actual file once the RandomShuffleQueue spits out filenames. It seems I need another queue object to enqueue the result into that only dequeues filenames to feed into a WholeFileReader right? – bschreck Dec 03 '15 at 14:03
  • @bschreck If this is the accepted answer then you should [accept](http://stackoverflow.com/help/someone-answers) it. :) – Guy Coder Dec 18 '15 at 14:18
  • I have the same problem as @beshreck -- this is totally possible, but it doesn't solve the problem of subsequently reading the image. The fact that the file readers (of any kind) require queue as an input rather than accepting the output of a general node is ludicrous, it causes the number of queues required to proliferate. – eriophora Mar 29 '16 at 20:19
  • 1
    Oh god that's exactly what `tf.read_file()` is for. How have I never seen this!? – eriophora Mar 29 '16 at 20:24
  • Thank you so much, that helped got a hang of the peculiarities of data reading in Tensorflow. – Gooshan Apr 17 '16 at 14:48
4

Here's what I was able to do.

I first shuffled the filenames and matched the labels to them in Python:

np.random.shuffle(filenames)
labels = [label_dict[f] for f in filenames]

Then created a string_input_producer for the filenames with shuffle off, and a FIFO for labels:

lv = tf.constant(labels)
label_fifo = tf.FIFOQueue(len(filenames),tf.int32, shapes=[[]])
file_fifo = tf.train.string_input_producer(filenames, shuffle=False, capacity=len(filenames))
label_enqueue = label_fifo.enqueue_many([lv])

Then to read the image I could use a WholeFileReader and to get the label I could dequeue the fifo:

reader = tf.WholeFileReader()
image = tf.image.decode_jpeg(value, channels=3)
image.set_shape([128,128,3])
result.uint8image = image
result.label = label_fifo.dequeue()

And generate the batches as follows:

min_fraction_of_examples_in_queue = 0.4
min_queue_examples = int(num_examples_per_epoch *
                         min_fraction_of_examples_in_queue)
num_preprocess_threads = 16
images, label_batch = tf.train.shuffle_batch(
  [result.uint8image, result.label],
  batch_size=FLAGS.batch_size,
  num_threads=num_preprocess_threads,
  capacity=min_queue_examples + 3 * FLAGS.batch_size,
  min_after_dequeue=min_queue_examples)
bschreck
  • 724
  • 7
  • 18
  • This framework of reading in labels fits well into my code which also uses `tf.WholeFileReader` to read in image filenames, however, users have to remember to run `sess.run(label_enqueue)` before starting training otherwise program hangs there and waits for the enqueue operation to happen. – Zhongyu Kuang Mar 21 '17 at 01:02
  • I was trying to use the same ideas as your code, but I wasn't able to keep the labels in sync with the images. http://stackoverflow.com/questions/43567552/tf-slice-input-producer-not-keeping-tensors-in-sync – rasen58 Apr 23 '17 at 16:50
1

There is tf.py_func() you could utilize to implement a mapping from file path to label.

files = gfile.Glob(data_pattern)
filename_queue = tf.train.string_input_producer(
files, num_epochs=num_epochs, shuffle=True) #  list of files to read

def extract_label(s):
    # path to label logic for cat&dog dataset
    return 0 if os.path.basename(str(s)).startswith('cat') else 1

def read(filename_queue):
  key, value = reader.read(filename_queue)
  image = tf.image.decode_jpeg(value, channels=3)
  image = tf.cast(image, tf.float32)
  image = tf.image.resize_image_with_crop_or_pad(image, width, height)
  label = tf.cast(tf.py_func(extract_label, [key], tf.int64), tf.int32)
  label = tf.reshape(label, [])

training_data = [read(filename_queue) for _ in range(num_readers)]

...

tf.train.shuffle_batch_join(training_data, ...)
Yuntai Kyong
  • 11
  • 1
  • 2
0

I used this:

 filename = filename.strip().decode('ascii')
Linda MacPhee-Cobb
  • 7,646
  • 3
  • 20
  • 18
0

Another suggestion is to save your data in TFRecord format. In this case you would be able to save all images and all labels in the same file. For a big number of files it gives a lot of advantages:

  • can store data and labels at the same place
  • data is allocated at one place (no need to remember various directories)
  • if there are many files (images), opening/closing a file is time consuming. Seeking the location of the file from ssd/hdd also takes time
Salvador Dali
  • 214,103
  • 147
  • 703
  • 753