TensorFlow create dataset from numpy array

Question

TensorFlow as build it a nice way to store data. This is for example used to store the MNIST data in the example:

>>> mnist
<tensorflow.examples.tutorials.mnist.input_data.read_data_sets.<locals>.DataSets object at 0x10f930630>

Suppose to have a input and output numpy arrays.

>>> x = np.random.normal(0,1, (100, 10))
>>> y = np.random.randint(0, 2, 100)

How can I transform them in a tf dataset?

I want to use functions like next_batch

score 9 · Accepted Answer · edited May 13 '16 at 10:36

9

The Dataset object is only part of the MNIST tutorial, not the main TensorFlow library.

You can see where it is defined here:

GitHub Link

The constructor accepts an images and labels argument so presumably you can pass your own values there.

edited May 13 '16 at 10:36

Strik3r

1,052
8
15

answered Dec 18 '15 at 17:47

Ian Goodfellow

2,584
2
19
20

Ok thanks I had this suspect. I think it would be a helpful tool as part of the main library. AFAIK any batch operation on numpy array requires to perform a copy of the data. This may lead to a slower algorithm – Donbeo Dec 18 '15 at 17:50
The philosophy is that TensorFlow should just be a core math library, but other open source libraries can provide additional abstractions used for machine learning. Similar to Theano which has libraries like Pylearn2 built on top. If you want to avoid copy operations you can use the queue-based data access functionality rather than feeding placeholders. – Ian Goodfellow Dec 18 '15 at 17:53

score 3 · Answer 2 · answered Apr 12 '18 at 21:52

3

Recently, Tensorflow added a feature to its dataset api to consume numpy array. See here for details.

Here is the snippet that I copied from there:

# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)

dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
# [Other transformations on `dataset`...]
dataset = ...
iterator = dataset.make_initializable_iterator()

sess.run(iterator.initializer, feed_dict={features_placeholder: features,
                                          labels_placeholder: labels})

answered Apr 12 '18 at 21:52

MajidL

731
6
11

In tf2 this does not work anymore. Do you know what is the recommended way in tf2? – Anatoly Alekseev Dec 20 '20 at 13:08
for TF2, please check [this link](https://www.tensorflow.org/guide/data#consuming_numpy_arrays) – MajidL Jan 21 '21 at 19:49
@MajidL do you know how this can be done if all of your dataset does not fit into memory? – StopReadingThisUsername Apr 08 '21 at 23:32
You mean your data set is in NumPy format, but it cannot be loaded into memory? If this is the case, [this solution](https://stackoverflow.com/a/42727761/2543480) may help. – MajidL May 11 '21 at 01:45

score 0 · Answer 3 · answered Nov 04 '17 at 14:42

As a alternative, you may use the function tf.train.batch() to create a batch of your data and at the same time eliminate the use of tf.placeholder. Refer to the documentation for more details.

>>> images = tf.constant(X, dtype=tf.float32) # X is a np.array
>>> labels = tf.constant(y, dtype=tf.int32)   # y is a np.array
>>> batch_images, batch_labels = tf.train.batch([images, labels], batch_size=32, capacity=300, enqueue_many=True)

TensorFlow create dataset from numpy array

3 Answers3