0

My goal is as follows:

1). Use tf.train.string_input_producer and tf.TextLineReader to read lines from files.

2). Convert the resulting tensors containing the files' lines into ordinary strings using eval to do preprocessing before batching (TensorFlow's limited string operations are insufficient for my purposes)

3). Convert these preprocessed strings back to tensors (presumably using tf.constant ?)

4). Use tf.train.batch on the resulting tensors.

The following code is a simplified version of what I'm working on.

The "After batch" print statement gets executed, the REPL hangs on the print statement with the final eval.

From what I've read, I have a feeling this is because

threads = tf.train.start_queue_runners(coord = coord, sess = sess)

needs to be run after calling tf.train.batch. But if I do this, then the REPL will of course hang on the first eval

evalue = value.eval(session = sess)

needed to do the preprocessing.

What is the best way to convert back and forth between tensors and their values inbetween queues? (I'm really hoping I can do this without preprocessing my data files beforehand.)

import tensorflow as tf
import os

def process(string):
    return string.upper()

def main():

    sess = tf.Session()

    filenames = tf.constant(["test_data/" + f for f in os.listdir("./test_data")])

    filename_queue = tf.train.string_input_producer(filenames)
    file_reader = tf.TextLineReader()
    key, value = file_reader.read(filename_queue)

    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord = coord, sess = sess)

    evalue = value.eval(session = sess)
    proc_value = process(evalue)
    tensor_value = tf.constant(proc_value)

    batch = tf.train.batch([tensor_value], batch_size = 2, capacity = 2)

    print "After batch."
    print batch.eval(session = sess)
tarski
  • 152
  • 9
  • Not the answer you are looking for, but I would advice against using query runners to read your files. You are dropping in and out of TensorFlow and that will interfere with the data flow, reading text files is not faster in TensorFlow than out and shuffling and batching can be done after the post-processing or in plain python as well. Just read your files in pure python and feed the resulting lines either directly into model placeholders or into a batching input pipline – Mad Wombat Apr 28 '17 at 20:15
  • Thanks, @MadWombat. – tarski Apr 28 '17 at 20:32

1 Answers1

0

We discussed a slightly different approach, which I think achieves what you need here:

Converting TensorFlow tutorial to work with my own data

Not sure what file formats you are reading, but the above example reads CSVs row-by-row and packs them into randomized batches.

If you are reading from a CSV, then, in a nutshell, I think what you might want to do is instead of returning value from file_reader.read(filename_queue) immediately, you could try to do some pre-processing first, and return THAT instead, something like this:

rDefaults = [['a'] for row in range((ROW_LENGTH))]
_, value = reader.read(filename_queue)
whole_row = tf.decode_csv(value, record_defaults=rDefaults)
cell1 = tf.slice(whole_row, [0], [1]) # one specific cell that contains a string
cell2 = tf.slice(whole_row, [1], [2]) # another cell that contains a string
# do some processing on cell1 and cell2
return cell1, cell2
Community
  • 1
  • 1
VS_FF
  • 2,353
  • 3
  • 16
  • 34