It looks like you are reading your CSV into a DataFrame? You can certainly implement a batching process this way by hand, but there is a an effective built-in way of building queues and batches in TF. It's a bit convoluted, but it works well for serving rows either sequentially or by random shuffling, which is quite convenient. Just make sure that your rows are all equal length and this way you could easily specify which sells represent Xes and which represent Ys.
The two functions you need for this are tf.decode_csv
and tf.train.shuffle_batch
(or tf.train.batch
if you don't need random shuffling).
We discussed this at length in this post, which includes a full working code example:
TF CSV Batching Example
It looks like your data is all numeric and Ys are in one-hot format, so the MNIST example should be good for implementing your estimation function.
***UPDATE:
This is roughly the order of operations:
1. define the two functions as shown in the linked example -- one to read the CSV file row-by-row and the other to pack each of those rows into batches of N (either randomly or sequentially)
2. start the reading loop via while not coord.should_stop():
this loop will run until it exhausts the content of all of your CSV files(s) that you feed to the queues
3. In each iteration of the loop, doing sess.run
on these variables gives you your batches of Xs and Ys, plus whatever extra meta-type content you may want from each line of your CSV file, such as the date-label in this example (in your case it may be student's name or whatever:
dateLbl_batch, feature_batch, label_batch = sess.run([dateLbl, features, labels])
When TF reaches the end of your file(s), it will throw an exception, which is why all the above code is in the try/catch block -- by catching that exception you know that you are done.
The above functionality gives you very granular cell-by-cell level access to your CSV files and allows you to batch them into batches of N, into the number of epochs you want, etc.
***** UPDATE 2**
Here's the full code that should read your CSV file in batches, in the format that you have. It simply prints the content of each batch. From here, you can easily connect this code with your code that actually does the training/etc.
import tensorflow as tf
fileName = 'data/study.csv'
try_epochs = 1
batch_size = 3
S = 1 # this is your Student label
F = 2 # this is the list of your features
L = 3 # this is one-hot vector of 3 representing the label
# set defaults to something (TF requires defaults for the number of cells you are going to read)
rDefaults = [['a'] for row in range((S+F+L))]
# function that reads the input file, line-by-line
def read_from_csv(filename_queue):
reader = tf.TextLineReader(skip_header_lines=True) # skipt the header line
_, csv_row = reader.read(filename_queue) # read one line
data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)
studentLbl = tf.slice(data, [0], [S]) # first cell is my 'date-label' for internal pruposes
features = tf.string_to_number(tf.slice(data, [S], [F]), tf.float32) # cells 2-480 is the list of features
label = tf.string_to_number(tf.slice(data, [S+F], [L]), tf.float32) # the remainin 3 cells is the list for one-hot label
return studentLbl, features, label
# function that packs each read line into batches of specified size
def input_pipeline(fName, batch_size, num_epochs=None):
filename_queue = tf.train.string_input_producer(
[fName],
num_epochs=num_epochs,
shuffle=True) # this refers to multiple files, not line items within files
dateLbl, features, label = read_from_csv(filename_queue)
min_after_dequeue = 10000 # min of where to start loading into memory
capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
# this packs the above lines into a batch of size you specify:
dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
[dateLbl, features, label],
batch_size=batch_size,
capacity=capacity,
min_after_dequeue=min_after_dequeue)
return dateLbl_batch, feature_batch, label_batch
# these are the student label, features, and label:
studentLbl, features, labels = input_pipeline(fileName, batch_size, try_epochs)
with tf.Session() as sess:
gInit = tf.global_variables_initializer().run()
lInit = tf.local_variables_initializer().run()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
try:
while not coord.should_stop():
# load student-label, features, and label as a batch:
studentLbl_batch, feature_batch, label_batch = sess.run([studentLbl, features, labels])
print(studentLbl_batch);
print(feature_batch);
print(label_batch);
print('----------');
except tf.errors.OutOfRangeError:
print("Done looping through the file")
finally:
coord.request_stop()
coord.join(threads)
Assuming that your CSV file looks something like this:
name studytime attendance A B C
S1 2 1 0 1 0
S2 3 2 1 0 0
S3 4 3 0 0 1
S4 3 5 0 0 1
S5 4 4 0 1 0
S6 2 1 1 0 0
The above code should print the following output:
[[b'S5']
[b'S6']
[b'S3']]
[[ 4. 4.]
[ 2. 1.]
[ 4. 3.]]
[[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]]
----------
[[b'S2']
[b'S1']
[b'S4']]
[[ 3. 2.]
[ 2. 1.]
[ 3. 5.]]
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]
----------
Done looping through the file
So instead of printing the content of the batches, simply use them as X's and Y's for your training in the feed_dict