2

I have been trying to get tensor flow working on a multi-class kaggle problem. Basically the data consists of 6 features which I have converted to all numeric observations. The goal is to use these 6 features to predict a trip type, where there are 38 different trip types. I have been trying to use tensorflow to predict these trip type classes. The following code is what I have thus far, including what I had used to format the csv file. The code will run, but the output starts off ok for run 1, and then is very poor with the same output for the remainder of the runs. The following is the example of the output when it is running:

Run 0,0.268728911877
Run 1,0.0108088823035
Run 2,0.0108088823035
Run 3,0.0108088823035
Run 4,0.0108088823035
Run 5,0.0108088823035
Run 6,0.0108088823035
Run 7,0.0108088823035
Run 8,0.0108088823035
Run 9,0.0108088823035
Run 10,0.0108088823035
Run 11,0.0108088823035
Run 12,0.0108088823035
Run 13,0.0108088823035
Run 14,0.0108088823035

And the code:

import tensorflow as tf
import numpy as np
from numpy import genfromtxt
import sklearn
import pandas as pd
from sklearn.cross_validation import train_test_split
import sklearn
# function buildWalMartData takes in a csv file, converts to numpy         array, splits into training 
# and testing, then saves the file to specified target directory 
def buildWalmartData():
    df =    pd.read_csv('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/full_train_complete.csv')
    df = df.drop('Unnamed: 0', 1) # 1 specifies axis to remove
    df_data = np.array(df.drop('TripType', 1).values) # convert to numpy array
    df_label = np.array(df['TripType'].values) # convert to numpy array
    X_train, X_test, y_train, y_test = train_test_split(df_data, df_label, test_size=0.25, random_state=50)
    f = open('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/wm-training.csv', 'w')
    for i,j in enumerate(X_train):
        k = np.append(np.array(y_train[i]), j)
        f.write(','.join([str(s) for s in k]) + '\n')
    f.close()
    f = open('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/wm-testing.csv', 'w')
    for i,j in enumerate(X_test):
        k=np.append(np.array(y_test[i]), j)
        f.write(','.join([str(s) for s in k]) + '\n')
    f.close() 
buildWalmartData()
# function convertOnehot takes in data and converts to tensorflow oneHot
# The corresponding labels in Wallmat TripType are numbers between 1 and 38, describing
# which trip is taken. We have already converted the labels to a one-hot vector, which is a 
# vector that is 0 in most dimensions, and 1 in a single dimension. In this case, the nth triptype
# will be represented as a vector which is 1 in the nth dimensions. 
def convertOneHot(data):
    y = np.array([int(i[0]) for i in data])
    y_onehot = [0]*len(y)
    for i,j in enumerate(y):
        y_onehot[i]=[0]*(y.max()+1)
        y_onehot[i][j] = 1
    return (y, y_onehot)

# import training data
data = genfromtxt('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/wm-training.csv', delimiter=',') 

# import testing data
test_data = genfromtxt('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/wm-testing.csv', delimiter=',')

x_train = np.array([i[1::] for i in data])

# example output for x_train:
#array([[  7.06940000e+04,   5.00000000e+00,   7.91005185e+09,
#          1.00000000e+00,   8.00000000e+00,   2.15000000e+02],
#       [  1.54653000e+05,   4.00000000e+00,   5.20001225e+09,
#          1.00000000e+00,   5.00000000e+00,   4.60700000e+03],
#       [  1.86178000e+05,   3.00000000e+00,   4.32136106e+09,
#         -1.00000000e+00,   5.00000000e+01,   1.90000000e+03],

y_train, y_train_onehot = convertOneHot(data)

x_test = np.array([ i[1::] for i in test_data])
y_test, y_test_onehot = convertOneHot(test_data)
# exmaple y_test output
#array([ 5, 32, 24, ..., 31, 28,  5])

# and example y_test_onehot:
#[0,...
# 0,
# 0,
# 0,
# 0,
# 0,
# 0,
# 1,
# 0,
# 0,
# 0,
# 0,
# 0]


# A is the number of features, 6 in the wallmart data
# B=38, which is the number of trip types 
A = data.shape[1]-1
B = len(y_train_onehot[0])
tf_in = tf.placeholder('float', [None, A]) # features
tf_weight = tf.Variable(tf.zeros([A,B]))
tf_bias = tf.Variable(tf.zeros([B]))
tf_softmax = tf.nn.softmax(tf.matmul(tf_in, tf_weight) + tf_bias)

# training via backpropogation
tf_softmax_correct = tf.placeholder('float', [None, B])
tf_cross_entropy = - tf.reduce_sum(tf_softmax_correct*tf.log(tf_softmax))

# training using tf.train.GradientDescentOptimizer
tf_train_step =   tf.train.GradientDescentOptimizer(0.01).minimize(tf_cross_entropy)

# add accuracy nodes
tf_correct_prediction = tf.equal(tf.argmax(tf_softmax,1),     tf.argmax(tf_softmax_correct, 1))
tf_accuracy = tf.reduce_mean(tf.cast(tf_correct_prediction, 'float'))


# initialize and run
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)


# running the training
for i in range(20):
    sess.run(tf_train_step, feed_dict={tf_in: x_train,   tf_softmax_correct: y_train_onehot})
    # print accuracy
    result = sess.run(tf_accuracy, feed_dict={tf_in: x_test,  tf_softmax_correct: y_test_onehot})
    print "run {},{}".format(i,result)

Any thoughts regarding what might be going wrong here as to why the runs would degenerate like this would be greatly appreciated. Thanks!

cchamberlain
  • 17,444
  • 7
  • 59
  • 72
datawrestler
  • 1,527
  • 15
  • 17
  • This question looks *really* broad and I'd be surprised if anybody is able to help you. – Ross Nov 27 '15 at 03:54
  • See if colah and my answers to http://stackoverflow.com/questions/33641799/why-does-tensorflow-example-fail-when-increasing-batch-size help you out. – dga Nov 27 '15 at 04:45

1 Answers1

1

If you just want something up and running quickly for Kaggle competition, I would suggest you trying out examples in TFLearn first. There's embedding_ops for one-hot, examples for early stopping, custom decay, and more importantly, the multi-class classification/regression you are encountering. Once you are more familiar with TensorFlow it would be fairly easy for you to insert TensorFlow code to build a custom model you want (there are also examples for this).

Yuan Tang
  • 696
  • 4
  • 15