21

TensorFlow as build it a nice way to store data. This is for example used to store the MNIST data in the example:

>>> mnist
<tensorflow.examples.tutorials.mnist.input_data.read_data_sets.<locals>.DataSets object at 0x10f930630>

Suppose to have a input and output numpy arrays.

>>> x = np.random.normal(0,1, (100, 10))
>>> y = np.random.randint(0, 2, 100)

How can I transform them in a tf dataset?

I want to use functions like next_batch

Amir
  • 10,600
  • 9
  • 48
  • 75
Donbeo
  • 17,067
  • 37
  • 114
  • 188

3 Answers3

9

The Dataset object is only part of the MNIST tutorial, not the main TensorFlow library.

You can see where it is defined here:

GitHub Link

The constructor accepts an images and labels argument so presumably you can pass your own values there.

Strik3r
  • 1,052
  • 8
  • 15
Ian Goodfellow
  • 2,584
  • 2
  • 19
  • 20
  • Ok thanks I had this suspect. I think it would be a helpful tool as part of the main library. AFAIK any batch operation on numpy array requires to perform a copy of the data. This may lead to a slower algorithm – Donbeo Dec 18 '15 at 17:50
  • The philosophy is that TensorFlow should just be a core math library, but other open source libraries can provide additional abstractions used for machine learning. Similar to Theano which has libraries like Pylearn2 built on top. If you want to avoid copy operations you can use the queue-based data access functionality rather than feeding placeholders. – Ian Goodfellow Dec 18 '15 at 17:53
3

Recently, Tensorflow added a feature to its dataset api to consume numpy array. See here for details.

Here is the snippet that I copied from there:

# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)

dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
# [Other transformations on `dataset`...]
dataset = ...
iterator = dataset.make_initializable_iterator()

sess.run(iterator.initializer, feed_dict={features_placeholder: features,
                                          labels_placeholder: labels})
MajidL
  • 731
  • 6
  • 11
0

As a alternative, you may use the function tf.train.batch() to create a batch of your data and at the same time eliminate the use of tf.placeholder. Refer to the documentation for more details.

>>> images = tf.constant(X, dtype=tf.float32) # X is a np.array
>>> labels = tf.constant(y, dtype=tf.int32)   # y is a np.array
>>> batch_images, batch_labels = tf.train.batch([images, labels], batch_size=32, capacity=300, enqueue_many=True)