Converting TensorFlow tutorial to work with my own data

Question

This is a follow on from my last question Converting from Pandas dataframe to TensorFlow tensor object

I'm now on the next step and need some more help. I'm trying to replace this line of code

batch = mnist.train.next_batch(100)

with a replacement for my own data. I've found this answer on StackOverflow: Where does next_batch in the TensorFlow tutorial batch_xs, batch_ys = mnist.train.next_batch(100) come from? But I don't understand:

1) Why the .next_batch() doesn't work on my tensor. Am I creating it incorrectly

2) How to implement the pseudocode that was given in the answer to the question on .next_batch()

I currently have two tensor objects, one with the parameters I wish to use to train the model (dataVar_tensor) and one with the correct result (depth_tensor). I obviously need to keep their relationship to keep the correct response with the correct parameters.

Please can you take some time to help me understand what's going on and to replace this line of code?

Many thanks

Just saw your update on the previous post. Good to see you got it working. It seems like you are trying things backward -- first loading data from CSV into a DataFrame and then trying to read stuff from the DataFrame in batches? My impression is that the 'typical' TF way is to just read stuff directly out of CSV files, and this way TF has lots of helpful queuing/randomization/batching functionality already built-in. — VS_FF, Feb 17 '17 at 18:00
See this discussion we had with someone earlier about the mechanics of reading lines out of multiple CSV files. Hopefully should be clear enough: http://stackoverflow.com/questions/42175609/using-multiple-input-pipeline-in-tensorflow/42177088?noredirect=1#comment71532417_42177088 — VS_FF, Feb 17 '17 at 18:01
By the way this will avoid your issues of converting DataFrames into tensors, as this way everything gets sliced and loaded into tensors directly out of the CSV and is done when needed, not up-front, so saves you resources. — VS_FF, Feb 17 '17 at 18:02
@VS_FF I have one textfile that contains the variables I want to train on, the anticipated result and a bunch of other stuff. Are you saying that I can do all the data splitting and preparation directly in TensorFlow? I'll be honest, I didn't fully understand your example in that other thread — jlt199, Feb 21 '17 at 15:30
Yes, it does all of the following: read text line-by-line, split each line into a set of observations, a label for the observation, and some other stuff for monitoring. TF then packs each line-read operation into a batch of the given size and randomizes the sampling process, so that the file is not read sequentially, but sampled at random. The only thing is that it's a CSV file, I assume yours is also somehow comma or space delimited? — VS_FF, Feb 21 '17 at 16:29
@VS_FF Yes, mine is a csv file. I'll have to go back and have another look at the example you provided in the other thread, because I am in a right mess — jlt199, Feb 21 '17 at 16:46
I posted in the Answer area the full code that should look pretty clean. The other example I cut off the unnecessary lines and it made the whole formatting very difficult. The code here is complete and runs on my machine. — VS_FF, Feb 21 '17 at 17:00

score 3 · Accepted Answer · answered Feb 21 '17 at 16:59

I stripped off the non-relevant stuff so as to preserve the formatting and indentation. Hopefully it should be clear now. The following code reads a CSV file in batches of N lines (N specified in a constant at the top). Each line contains a date (first cell), then a list of floats (480 cells) and a one-hot vector (3 cells). The code then simply prints the batches of these dates, floats, and one-hot vector as it reads them. The place where it prints them is normally where you'd actually run your model and feed these in place of the placeholder variables.

Just keep in mind that here it reads each line as a String, and then converts the specific cells within that line into floats, simply because the first cell is easier to read as a string. If all your data is numeric, then simply set the defaults into a float/int rather than an 'a' and get rid of the code that converts strings to floats. It's not needed otherwise!

I put some comments to clarify what it's doing. Let me know if something is unclear.

import tensorflow as tf

fileName = 'YOUR_FILE.csv'

try_epochs = 1
batch_size = 3

TD = 1 # this is my date-label for each row, for internal pruposes
TS = 480 # this is the list of features, 480 in this case
TL = 3 # this is one-hot vector of 3 representing the label

# set defaults to something (TF requires defaults for the number of cells you are going to read)
rDefaults = [['a'] for row in range((TD+TS+TL))]

# function that reads the input file, line-by-line
def read_from_csv(filename_queue):
    reader = tf.TextLineReader(skip_header_lines=False) # i have no header file
    _, csv_row = reader.read(filename_queue) # read one line
    data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)
    dateLbl = tf.slice(data, [0], [TD]) # first cell is my 'date-label' for internal pruposes
    features = tf.string_to_number(tf.slice(data, [TD], [TS]), tf.float32) # cells 2-480 is the list of features
    label = tf.string_to_number(tf.slice(data, [TD+TS], [TL]), tf.float32) # the remainin 3 cells is the list for one-hot label
    return dateLbl, features, label

# function that packs each read line into batches of specified size
def input_pipeline(fName, batch_size, num_epochs=None):
    filename_queue = tf.train.string_input_producer(
        [fName],
        num_epochs=num_epochs,
        shuffle=True)  # this refers to multiple files, not line items within files
    dateLbl, features, label = read_from_csv(filename_queue)
    min_after_dequeue = 10000 # min of where to start loading into memory
    capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
    # this packs the above lines into a batch of size you specify:
    dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
        [dateLbl, features, label], 
        batch_size=batch_size,
        capacity=capacity,
        min_after_dequeue=min_after_dequeue)
    return dateLbl_batch, feature_batch, label_batch

# these are the date label, features, and label:
dateLbl, features, labels = input_pipeline(fileName, batch_size, try_epochs)

with tf.Session() as sess:

    gInit = tf.global_variables_initializer().run()
    lInit = tf.local_variables_initializer().run()

    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    try:
        while not coord.should_stop():
            # load date-label, features, and label:
            dateLbl_batch, feature_batch, label_batch = sess.run([dateLbl, features, labels])      

            print(dateLbl_batch);
            print(feature_batch);
            print(label_batch);
            print('----------');

    except tf.errors.OutOfRangeError:
        print("Done looping through the file")

    finally:
        coord.request_stop()

    coord.join(threads)

I think I get the jist what is going on in this code thank you. I've been able to get the code working on my data. However, I can't see how to edit the code to allow me to filter on some of the values. For example, in the current case, I'm only interested in rows where the ActualIE = 1. Can this be done? Thanks again for taking your time to help me — jlt199, Feb 21 '17 at 21:30
maybe experiment a bit with the return values of the first defined function? You can see that the code is definitely there to evaluate such conditions as 'AcutalIIE==1', whatever that might be. The part I'm not sure about is for example whether train_shuffle_batch will understand down the line that the given line needs to be skipped if that function returned null, or some similar logic? — VS_FF, Feb 21 '17 at 21:39
I've tried adding a while loop to the 'read_from_csv' function, but I can't access the number from the tensor object to set as a condition on the loop. Is there an easy way to do this? — jlt199, Feb 22 '17 at 15:47
I think it would probably be easier and cleaner to do that filtering outside of TF by using the DataFrame. This way you leave TF to do what it's actually meant to do. For example create the original DF =pandas.read_csv() and then another DF_1 = DF[YourCondition=True] and then save DF_1.to_csv()? — VS_FF, Feb 22 '17 at 21:15
Oh yes, that's such a simple solution that I didn't think of it! :) — jlt199, Feb 23 '17 at 17:37

Converting TensorFlow tutorial to work with my own data

1 Answers1

Linked