How to load sparse data with TensorFlow?

Question

There is a small snippet about loading sparse data but I have no idea how to use it.

SparseTensors don't play well with queues. If you use SparseTensors you have to decode the string records using tf.parse_example after batching (instead of using tf.parse_single_example before batching).

Source

I guess I don't really get how the data is loaded.

The data I want to load is in the SVM Light format

The way I am thinking of this is to convert the training set to the TFRecords file format and then load this converted data with tensorflow. The thing is I don't know how I am supposed to format my data so that tensorflow parses it as sparseTensors.

Here is a snippet extracted from one the examples available on GitHub:

def convert_to(images, labels, name):
  num_examples = labels.shape[0]
  if images.shape[0] != num_examples:
    raise ValueError("Images size %d does not match label size %d." %
                     (images.shape[0], num_examples))
  rows = images.shape[1]
  cols = images.shape[2]
  depth = images.shape[3]

  filename = os.path.join(FLAGS.directory, name + '.tfrecords')
  print('Writing', filename)
  writer = tf.python_io.TFRecordWriter(filename)
  for index in range(num_examples):
    image_raw = images[index].tostring()
    example = tf.train.Example(features=tf.train.Features(feature={
        'height': _int64_feature(rows),
        'width': _int64_feature(cols),
        'depth': _int64_feature(depth),
        'label': _int64_feature(int(labels[index])),
        'image_raw': _bytes_feature(image_raw)}))
    writer.write(example.SerializeToString())
  writer.close()

It encodes the image data as one big blob. The difference with my data is that not every feature is populated. I could be persisting my data in the same way but I am unsure this is the way to use the features.

That could not matter since I will be decoding things on the other hand but is there a better way to do this for sparse data ?

As for the reading, here is one example that reads dense tensor data.

I got that I was suppose to swap tf.parse_single_example with tf.parse_example and do it after batching.

However, how do I tell tensorflow that my data is sparse ? How do I associate the features indexes I have with the feature values in the tensor ? How can I do batching before even having loaded the data ?

EDIT 1:

Here is what I tried, I get a ValueError: Shape () must have rank 1 error:

from tqdm import *

def convert_to_tensor_file(path, out_file_name):

    feature_set = set()

    filename = os.path.join(FLAGS.directory, out_file_name + '.tfrecords')
    writer = tf.python_io.TFRecordWriter(filename)

    with open(path, 'r') as f:
        for line in tqdm(f):
            data = line.strip().split(' ')
            features = {
                "label": _int64_feature(int(data[0]))
            }
            for feature in data[1:]:
                index, value = feature.split(':')

                feature_set.add(index)

                features[index] = _int64_feature(int(value))

            example = tf.train.Example(features=tf.train.Features(feature=features))
            writer.write(example.SerializeToString())
        writer.close()

    return feature_set

feature_set = convert_to_tensor_file(TRAIN, 'train')

def load_tensor_file(name):
    filename = os.path.join(FLAGS.directory, name + '.tfrecords')

    features = {
        'label': tf.FixedLenFeature([], tf.int64),
    }

    for feature in feature_set:
        features[feature] = tf.VarLenFeature(tf.int64)

    with tf.name_scope('input'):
        filename_queue = tf.train.string_input_producer([filename])

        reader = tf.TFRecordReader()
        _, serialized_example = reader.read(filename_queue)
        features = tf.parse_example(serialized_example, features=features)

load_tensor_file('train')

Thank you,

score 19 · Answer 1 · edited Feb 11 '17 at 07:49

First, to explain what that documentation means:

For dense data usually you are doing:

Serialized Example (from reader) -> parse_single_example -> batch queue -> use it.
For sparse data you currently need to do:

Serialized Example (from reader) -> batch queue -> parse_example -> use it.

An example of this would be:

reader  = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
batch_serialized_examples = tf.shuffle_batch([serialized_example], batch_size)
feature_to_type = {
  'label': tf.FixedLenFeature([1], dtype=tf.int64),
  'sparse_feature': tf.VarLenFeature(dtype=tf.int64)
}
features = tf.parse_example(batch_serialized_examples, feature_to_type)

Note, shuffle_batch takes a series of strings and returns batch of strings. label should be fixed len of rank == 1 from your example.

Thx, for the answer, this raises a couple more questions: What is the batch step used for ? What's the definition of a `sparse_feature` here, is it the feature vector, a vector that contains several features or is it one element of the feature vector ? — Nicolas Joseph, Apr 29 '16 at 01:07

score 5 · Answer 2 · answered Apr 27 '17 at 14:19

Store indices and values in your TFRecords Examples, and parse with SparseFeature. For example, to store and load a sparse representation for:

[[0, 0, 0, 0, 0, 7],
 [0, 5, 0, 0, 0, 0],
 [0, 0, 0, 0, 9, 0],
 [0, 0, 0, 0, 0, 0]]

This creates a TFRecords Example:

my_example = tf.train.Example(features=tf.train.Features(feature={
    'index_0': tf.train.Feature(int64_list=tf.train.Int64List(value=[0, 1, 2])),
    'index_1': tf.train.Feature(int64_list=tf.train.Int64List(value=[5, 1, 4])),
    'values': tf.train.Feature(int64_list=tf.train.Int64List(value=[7, 5, 9]))
}))
my_example_str = my_example.SerializeToString()

And this parses it with SparseFeature:

my_example_features = {'sparse': tf.SparseFeature(index_key=['index_0', 'index_1'],
                                                  value_key='values',
                                                  dtype=tf.int64,
                                                  size=[4, 6])}
serialized = tf.placeholder(tf.string)
parsed = tf.parse_single_example(serialized, features=my_example_features)
session.run(parsed, feed_dict={serialized: my_example_str})

## {'sparse': SparseTensorValue(indices=array([[0, 5], [1, 1], [2, 4]]),
##                              values=array([7, 5, 9]),
##                              dense_shape=array([4, 6]))}

More exposition: Sparse Tensors and TFRecords

After reading this anwser and the blog, I'm still failed to figure out how to load 1-dimension sparse representation? — 宇宙人, Jul 25 '18 at 06:41

allen · Answer 3 · 2017-01-11T00:59:54.907

For libsvm format you can write and parse like below, if you want sparse tensor result(as opposed to dense tensor result using padding strategy)

    #---write
    _float_feature = lambda v: tf.train.Feature(float_list=tf.train.FloatList(value=v))
    _int_feature = lambda v: tf.train.Feature(int64_list=tf.train.Int64List(value=v))

    indexes = []
    values = []

    for item in l[start:]:
      index,value = item.split(':')
      indexes.append(int(index))
      values.append(float(value))

    example = tf.train.Example(features=tf.train.Features(feature={
      'label': _int_feature([label]),
      'num_features': _int_feature([num_features]),
      'index': _int_feature(indexes),
      'value': _float_feature(values)
      }))

    writer.write(example.SerializeToString())

    #---read
    def decode(batch_serialized_examples):
        features = tf.parse_example(
            batch_serialized_examples,
            features={
                'label' : tf.FixedLenFeature([], tf.int64),
                'index' : tf.VarLenFeature(tf.int64),
                'value' : tf.VarLenFeature(tf.float32),
            })

        label = features['label']
        index = features['index']
        value = features['value']

        return label, index, value

So by this way you will get label as dense tensor, index and value as two sparse tensors, you can see one self contained example of writing libsvm format to TFRecord and read it for mlp classification from

https://github.com/chenghuige/tensorflow-example/tree/master/examples/tf-record/sparse https://github.com/chenghuige/tensorflow-example/tree/master/examples/text-classification

score 1 · Answer 4 · answered Mar 07 '19 at 10:18

You can use weighted_categorical_column to parse index and value, eg.

categorical_column = tf.feature_column.categorical_column_with_identity(
            key='index', num_buckets=your_feature_dim)
sparse_columns = tf.feature_column.weighted_categorical_column(
    categorical_column=categorical_column, weight_feature_key='value')

then feed sparse_columns to linear model estimator, before feed to DNN, please use embedding, eg.

dense_columns = tf.feature_column.embedding_column(sparse_columns, your_embedding_dim)

then feed dense_columns to your DNN estimator

score 0 · Answer 5 · edited Mar 17 '18 at 10:50

If you are passing sparse values as inputs , you need to create sparse placeholders using tf.sparse_placeholder.

You should then convert your sparse tensors to dense tensor using tf.sparse_to_dense.

For this you need to explicitly pass the sparse matrix's values , shape and indices while feeding the data in feed_dict and then later use tf.sparse_to_dense in the graph.

In the graph :

dense = tf.sparse_to_dense(
    sparse_indices=sparse_placeholder.indices,
    output_shape=sparse_placeholder.shape,
    sparse_values=sparse_placeholder.values,
validate_indices=False)

In the feed_dict:

sparse_placeholder:tf.SparseTensorValue(indices=indices,values=sparse_values,dense_shape=sparse_shape)

Wouldn't converting the sparse to dense data negate the point of the sparse data structures at training time? — Charles, Aug 04 '18 at 18:06

How to load sparse data with TensorFlow?

5 Answers5

Linked