There is a small snippet about loading sparse data but I have no idea how to use it.
SparseTensors don't play well with queues. If you use SparseTensors you have to decode the string records using tf.parse_example after batching (instead of using tf.parse_single_example before batching).
I guess I don't really get how the data is loaded.
The data I want to load is in the SVM Light format
The way I am thinking of this is to convert the training set to the TFRecords file format and then load this converted data with tensorflow. The thing is I don't know how I am supposed to format my data so that tensorflow parses it as sparseTensors.
Here is a snippet extracted from one the examples available on GitHub:
def convert_to(images, labels, name):
num_examples = labels.shape[0]
if images.shape[0] != num_examples:
raise ValueError("Images size %d does not match label size %d." %
(images.shape[0], num_examples))
rows = images.shape[1]
cols = images.shape[2]
depth = images.shape[3]
filename = os.path.join(FLAGS.directory, name + '.tfrecords')
print('Writing', filename)
writer = tf.python_io.TFRecordWriter(filename)
for index in range(num_examples):
image_raw = images[index].tostring()
example = tf.train.Example(features=tf.train.Features(feature={
'height': _int64_feature(rows),
'width': _int64_feature(cols),
'depth': _int64_feature(depth),
'label': _int64_feature(int(labels[index])),
'image_raw': _bytes_feature(image_raw)}))
writer.write(example.SerializeToString())
writer.close()
It encodes the image data as one big blob. The difference with my data is that not every feature is populated. I could be persisting my data in the same way but I am unsure this is the way to use the features.
That could not matter since I will be decoding things on the other hand but is there a better way to do this for sparse data ?
As for the reading, here is one example that reads dense tensor data.
I got that I was suppose to swap tf.parse_single_example
with tf.parse_example
and do it after batching.
However, how do I tell tensorflow that my data is sparse ? How do I associate the features indexes I have with the feature values in the tensor ? How can I do batching before even having loaded the data ?
EDIT 1:
Here is what I tried, I get a ValueError: Shape () must have rank 1
error:
from tqdm import *
def convert_to_tensor_file(path, out_file_name):
feature_set = set()
filename = os.path.join(FLAGS.directory, out_file_name + '.tfrecords')
writer = tf.python_io.TFRecordWriter(filename)
with open(path, 'r') as f:
for line in tqdm(f):
data = line.strip().split(' ')
features = {
"label": _int64_feature(int(data[0]))
}
for feature in data[1:]:
index, value = feature.split(':')
feature_set.add(index)
features[index] = _int64_feature(int(value))
example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(example.SerializeToString())
writer.close()
return feature_set
feature_set = convert_to_tensor_file(TRAIN, 'train')
def load_tensor_file(name):
filename = os.path.join(FLAGS.directory, name + '.tfrecords')
features = {
'label': tf.FixedLenFeature([], tf.int64),
}
for feature in feature_set:
features[feature] = tf.VarLenFeature(tf.int64)
with tf.name_scope('input'):
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_example(serialized_example, features=features)
load_tensor_file('train')
Thank you,