How to read csv file and train data by Softmax regression in Tensorflow

Question

I just begin study Tensorflow, and I have one problem when training data. My problem is read csv file, then use softmax classification to estimate the grade of student (A,B or C) based on their time of study and attendance the class.

Grade of student

I define, then load csv file as

COLUMNS = ["studytime", "attendance", "A", "B", "C"]
FEATURES = ["studytime", "attendance"]
LABEL = ["A", "B", "C"]
training_set = pd.read_csv("hw1.csv", skipinitialspace=True,
                       skiprows=1, names=COLUMNS)

After that I define tensor for features and lable like this

feature_cols = [tf.contrib.layers.real_valued_column(k) for k in FEATURES]
labels = [tf.contrib.layers.real_valued_column(k) for k in LABEL]

Then I follow the way to train softmax with MNIST data at Tensorflow for MNIST

But I don't how to define batch_xs and batch_ys to train in this loop

for _ in range(1000):
batch_xs=????
batch_ys=????
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

and How can I define function to estimate the score of three students if their study and attendence time, for example, [11,7], [3,4],[1,0]

Could you help me to figure out this problem?

Thanks in advance,

score 0 · Answer 1 · edited May 23 '17 at 12:10

0

It looks like you are reading your CSV into a DataFrame? You can certainly implement a batching process this way by hand, but there is a an effective built-in way of building queues and batches in TF. It's a bit convoluted, but it works well for serving rows either sequentially or by random shuffling, which is quite convenient. Just make sure that your rows are all equal length and this way you could easily specify which sells represent Xes and which represent Ys.

The two functions you need for this are tf.decode_csv and tf.train.shuffle_batch (or tf.train.batch if you don't need random shuffling).

We discussed this at length in this post, which includes a full working code example: TF CSV Batching Example

It looks like your data is all numeric and Ys are in one-hot format, so the MNIST example should be good for implementing your estimation function.

***UPDATE: This is roughly the order of operations: 1. define the two functions as shown in the linked example -- one to read the CSV file row-by-row and the other to pack each of those rows into batches of N (either randomly or sequentially) 2. start the reading loop via while not coord.should_stop(): this loop will run until it exhausts the content of all of your CSV files(s) that you feed to the queues 3. In each iteration of the loop, doing sess.run on these variables gives you your batches of Xs and Ys, plus whatever extra meta-type content you may want from each line of your CSV file, such as the date-label in this example (in your case it may be student's name or whatever:

dateLbl_batch, feature_batch, label_batch = sess.run([dateLbl, features, labels])

When TF reaches the end of your file(s), it will throw an exception, which is why all the above code is in the try/catch block -- by catching that exception you know that you are done.

The above functionality gives you very granular cell-by-cell level access to your CSV files and allows you to batch them into batches of N, into the number of epochs you want, etc.

***** UPDATE 2**

Here's the full code that should read your CSV file in batches, in the format that you have. It simply prints the content of each batch. From here, you can easily connect this code with your code that actually does the training/etc.

import tensorflow as tf

fileName = 'data/study.csv'

try_epochs = 1
batch_size = 3

S = 1 # this is your Student label
F = 2 # this is the list of your features
L = 3 # this is one-hot vector of 3 representing the label

# set defaults to something (TF requires defaults for the number of cells you are going to read)
rDefaults = [['a'] for row in range((S+F+L))]

# function that reads the input file, line-by-line
def read_from_csv(filename_queue):
    reader = tf.TextLineReader(skip_header_lines=True) # skipt the header line
    _, csv_row = reader.read(filename_queue) # read one line
    data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)
    studentLbl = tf.slice(data, [0], [S]) # first cell is my 'date-label' for internal pruposes
    features = tf.string_to_number(tf.slice(data, [S], [F]), tf.float32) # cells 2-480 is the list of features
    label = tf.string_to_number(tf.slice(data, [S+F], [L]), tf.float32) # the remainin 3 cells is the list for one-hot label
    return studentLbl, features, label

# function that packs each read line into batches of specified size
def input_pipeline(fName, batch_size, num_epochs=None):
    filename_queue = tf.train.string_input_producer(
        [fName],
        num_epochs=num_epochs,
        shuffle=True)  # this refers to multiple files, not line items within files
    dateLbl, features, label = read_from_csv(filename_queue)
    min_after_dequeue = 10000 # min of where to start loading into memory
    capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
    # this packs the above lines into a batch of size you specify:
    dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
        [dateLbl, features, label],
        batch_size=batch_size,
        capacity=capacity,
        min_after_dequeue=min_after_dequeue)
    return dateLbl_batch, feature_batch, label_batch

# these are the student label, features, and label:
studentLbl, features, labels = input_pipeline(fileName, batch_size, try_epochs)

with tf.Session() as sess:

    gInit = tf.global_variables_initializer().run()
    lInit = tf.local_variables_initializer().run()

    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    try:
        while not coord.should_stop():
            # load student-label, features, and label as a batch:
            studentLbl_batch, feature_batch, label_batch = sess.run([studentLbl, features, labels])

            print(studentLbl_batch);
            print(feature_batch);
            print(label_batch);
            print('----------');

    except tf.errors.OutOfRangeError:
        print("Done looping through the file")

    finally:
        coord.request_stop()

    coord.join(threads)

Assuming that your CSV file looks something like this:

name    studytime   attendance  A   B   C
S1  2   1   0   1   0
S2  3   2   1   0   0
S3  4   3   0   0   1
S4  3   5   0   0   1
S5  4   4   0   1   0
S6  2   1   1   0   0

The above code should print the following output:

[[b'S5']
 [b'S6']
 [b'S3']]
[[ 4.  4.]
 [ 2.  1.]
 [ 4.  3.]]
[[ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]]
----------
[[b'S2']
 [b'S1']
 [b'S4']]
[[ 3.  2.]
 [ 2.  1.]
 [ 3.  5.]]
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]
----------
Done looping through the file

So instead of printing the content of the batches, simply use them as X's and Y's for your training in the feed_dict

edited May 23 '17 at 12:10

Community

1
1

answered May 13 '17 at 12:08

VS_FF

2,353
3
16
34

Thanks for your suggestion Mr. VS_FF. I just read the above post you mentioned, I understand at some main points but actually it seems complicated for me. In my question, you said that I need to use tf.train. batch, but I still confuse about how can it apply to define batch_xs and batch_ys? Could you show me more clearly. Because, in MNIST, they used the code mnist.train.next_batch(), but in my problem, I don't know how to revise it to apply my case. – Chuong Nguyen May 13 '17 at 13:21
I updated the original answer to give you an overview of what pieces of that code do. In the MNIST example they do it differently, but I find this approach nice when it comes specifically to reading CSV files, particularly for randomized batches, and particularly if you have multiple CSV files that you want to shuffle together. – VS_FF May 13 '17 at 14:29
Mr. VS_FF: Thanks for your update. I just read above code you mentioned. For sure, I run it directly with only change the name of file and putTS=2. However, your code didn't work in my case to read file csv. I found that your code work well until I reach the last line: coord.join(threads).Some of these errors like: " StringToNumberOp could not correctly convert string". Could you help me to figure out? – Chuong Nguyen May 14 '17 at 02:27
One thing that's mentioned in that that note is that in my case each row contains both strings and numbers. Since you need to provide default values for each cell in each row for TF to read the row, it's easier to provide the default values as strings and then convert the necessary cells to numbers. Unofrunately it won't work the other way around (i.e. if you provide defaults as floats, but have some string cells, TF will throw an error). So if all your data is numeric, you can skip that whole logic and just read them as floats or ints – VS_FF May 14 '17 at 07:19
But regarding your error, I'd suspect that you are either getting it because you are reaching some sort of EOF or maybe you have some characters that it won't recognize or maybe it's an encoding issue. As I said, the process is a bit crude, because it signals EOF as an exception that you have to handle. Maybe it's something relating to that? It's hard to say without knowing the content of your CSV – VS_FF May 14 '17 at 07:21
Mr. VS_FF. Thanks for your answer. Here is the link of my CSV: https://i.stack.imgur.com/93kQh.png It contains string as student name for each row. – Chuong Nguyen May 14 '17 at 07:45
I updated my original answer with the full code that should run through your CSV file in batches and print their content, including the student label. I suspect that the only issue that you had with the original code in the linked answer is that there reading the CSV file assumed no header line (as was indicated in the comment) whereas here you have the header line in your CSV file. So I changed the code to say skip_header_lines=True. I replicated the content of your CSV and the code works well. – VS_FF May 14 '17 at 08:21
One more thing just in case: I noticed in the past that TF has an issue reading CSV files that were modified/saved in Excel. I don't know if that's a problem with Excel or TF, but I have a simple workaround of simply loading the CSV into Pandas DataFrame and then re-dumping it into another CSV. Form there, TF reads it fine. So just beware if you are editing your CSVs in Excel. – VS_FF May 14 '17 at 08:22
@ Mr.VS_FF: Thanks for your answer, the code is work very well now, it's great. However, I get a problem that the accuracy is now very low 37.5%, by using the code : `print(sess.run(accuracy, feed_dict={x: feature_batch, y_: label_batch}))` , I don't know why the accuracy is so low like this. Could you help me to figure out? – Chuong Nguyen May 14 '17 at 13:53
Just to make sure, are you running sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) as well? Doing sess.run just on accuracy won't execute your train_step automatically. Also, keep in mind that doing sess.run twice, once for train_step and one for accuracy, will read from your queues twice, so it will waste your data. best thing to do would be to do sess.run on both train_step and accuracy simultaneously and keep watching how your accuracy changes over time. Something like tp, acc = sess.run([train_step, accuracy], feed_dict={x: feature_batch, y_: label_batch}) and then print(acc). – VS_FF May 14 '17 at 15:05
If your accuracy isn't improving over time, then you just have to think about whether your data is actually appropriate for the problem/etc. There's no magic. Make sure to use Tensorboard to watch how your accuracy, loss, etc. behave over time and by running various attempts at different hyper-parameters (optimizer type, learning rate, dropout, number of epochs, batch size, etc.). It's super useful – VS_FF May 14 '17 at 15:06
@Mr.VS_FF: Thanks for your answer. I do sess.run twice as you said, and change with various learning rate but the result is not improved. I just post my code in the following. Could you help me to show where I'm wrong? – Chuong Nguyen May 14 '17 at 23:41
How large is your data set? Can you post a link to the full data? I will try to give it a try.... – VS_FF May 15 '17 at 06:31
@@ Mr.VS_FF: Thank very much. Here is my training data. It is small size:. [training data](https://drive.google.com/file/d/0B6T98SInKVlTUGRIcENLcUhCWEE/view?usp=sharing) – Chuong Nguyen May 15 '17 at 06:52
The whole point of this exercise is to have thousands, or better thens/hundreds of thousands of observations to train your model on. You aren't going to get far with just a hand-full of observations. Problem is, even if you get the accuracy of ~100% (which you can easily with the right number of weights/etc.) your network will be simply memorizing the data perfectly, which will make it useless on the data it hasn't seen, regardless of the accuracy. – VS_FF May 15 '17 at 08:32
Get at least >1000 observations before trying anything – VS_FF May 15 '17 at 08:32
@ Mr.VS_FF: Thanks very much, I got a success to train data as you suggest. I cannot do it without your help. It is great help for my beginning in TF. – Chuong Nguyen May 16 '17 at 00:27
@ Mr.VS_FF: I just think more deeply about how Softmax works, and I try to discover the following code: `correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))` `print(sess.run(correct_prediction, feed_dict={x: feature_batch, y_: label_batch}))`. And the result is `[ True True False False True True True True]`. I think a lot but I cannot explain why the result have 8 entries.Could you help me to figure out? – Chuong Nguyen May 16 '17 at 00:33
You have 8 observations in your dataset. tf.argmax with y gives you the index of the one-hot vector that is on in the prediction. tf.argmax with y_ gives you the index of the one-hot vector that is in the actual label. Then tf.equal compares for which of your 8 entries the comparison is true and for which it is false. So, you have 8 values of true/false. Is that what you were asking? – VS_FF May 16 '17 at 07:05
Once again, I would really not be paying too much attention at this result. It is meaningless with just 8 observations. You need to be running this same step on thousands of observations so that your weights and biases keep adjusting with every run. Only after that your predictions will (hopefully) start to acquire any meaning. At this stage, you are getting your true/false results either by chance (because your weights and biases are initialized randomly) or simply because you have enough parameters (number of weights/biases) to simply memorize which combination gives the best result. – VS_FF May 16 '17 at 07:07
No matter how accurate it gets, it won't be useful, because it won't be able to generalize to the data it hasn't seen. As I said earlier, get a few thousand observations and only then come back to this problem. – VS_FF May 16 '17 at 07:08

score 0 · Answer 2 · answered May 14 '17 at 23:38

Here is my try. But the accuracy is not high as I expect.

import tensorflow as tf

fileName = 'hw1.csv'

try_epochs = 1
batch_size = 8

S = 1 # this is your Student label
F = 2 # this is the list of your features
L = 3 # this is one-hot vector of 3 representing the label

# set defaults to something (TF requires defaults for the number of cells you are going to read)
rDefaults = [['a'] for row in range((S+F+L))]

# function that reads the input file, line-by-line
def read_from_csv(filename_queue):
     reader = tf.TextLineReader(skip_header_lines=True) # skipt the header line
     _, csv_row = reader.read(filename_queue) # read one line
     data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)
     studentLbl = tf.slice(data, [0], [S]) # first cell is my 'date-label' for internal pruposes
     features = tf.string_to_number(tf.slice(data, [S], [F]), tf.float32) # cells 2-480 is the list of features
     label = tf.string_to_number(tf.slice(data, [S+F], [L]), tf.float32) # the remainin 3 cells is the list for one-hot label
     return studentLbl, features, label

# function that packs each read line into batches of specified size
def input_pipeline(fName, batch_size, num_epochs=None):
    filename_queue = tf.train.string_input_producer(
       [fName],
       num_epochs=num_epochs,
       shuffle=True)  # this refers to multiple files, not line items within files
    dateLbl, features, label = read_from_csv(filename_queue)
    min_after_dequeue = 10000 # min of where to start loading into memory
    capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
    # this packs the above lines into a batch of size you specify:
    dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
       [dateLbl, features, label],
       batch_size=batch_size,
       capacity=capacity,
       min_after_dequeue=min_after_dequeue)
    return dateLbl_batch, feature_batch, label_batch

# these are the student label, features, and label:
studentLbl, features, labels = input_pipeline(fileName, batch_size, 
 try_epochs)

x = tf.placeholder(tf.float32, [None, 2])

W = tf.Variable(tf.zeros([2, 3]))

b = tf.Variable(tf.zeros([3]))

y = tf.nn.softmax(tf.matmul(x, W) + b)

y_ = tf.placeholder(tf.float32, [None, 3])

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_,logits=y))

train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy)


sess = tf.InteractiveSession()

tf.global_variables_initializer().run()


with tf.Session() as sess:

   gInit = tf.global_variables_initializer().run()
   lInit = tf.local_variables_initializer().run()

   coord = tf.train.Coordinator()
   threads = tf.train.start_queue_runners(coord=coord)

   try:
      while not coord.should_stop():
        # load student-label, features, and label as a batch:
        studentLbl_batch, feature_batch, label_batch = sess.run([studentLbl, features, labels])

        print(studentLbl_batch);
        print(feature_batch);
        print(label_batch);
        print('----------');
        batch_xs = feature_batch
        batch_ys = label_batch
        sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})  # feeding data

  except tf.errors.OutOfRangeError:
     print("Done looping through the file")

  finally:
     coord.request_stop()

  coord.join(threads)


  correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))

  accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

  print(sess.run(accuracy, feed_dict={x: feature_batch, y_: label_batch}))

  print(sess.run(W))
  print(sess.run(b))

The accuracy

  0.375

W,b

    [[ 0.00555556  0.00972222 -0.01527778] [ 0.00555556  0.01388889 -0.01944444]]
    [-0.00277778  0.00138889  0.00138889]

How to read csv file and train data by Softmax regression in Tensorflow

2 Answers2