53

there are plenty of examples how to create and use TensorFlow datasets, e.g.

dataset = tf.data.Dataset.from_tensor_slices((images, labels))

My question is how to get back the data/labels from the TF dataset in numpy form? In other words want would be reverse operation of the line above, i.e. I have a TF dataset and want to get back images and labels from it.

Valentin
  • 1,492
  • 3
  • 18
  • 27

13 Answers13

80

In case your tf.data.Dataset is batched, the following code will retrieve all the y labels:

y = np.concatenate([y for x, y in ds], axis=0)

Quick explanation: [y for x, y in ds] is known as “list comprehension” in python. If dataset is batched, this expression will loop thru each batch and put each batch y (a TF 1D tensor) in the list, and return it. Then, np.concatenate will take this list of 1-D tensor (implicitly casting to numpy) and stack it in the 0-axis to produce a single long vector. In summary, it is just converting a bunch of 1-d little vector into one long vector.

Note: if your y is more complex, this answer will need some minor modification.

Warlax56
  • 1,170
  • 5
  • 30
kawingkelvin
  • 3,649
  • 2
  • 30
  • 50
  • 4
    Elegant and pythonic! +1 – Tim Mironov Jan 17 '21 at 22:11
  • @TimMironov Thanks. I could also have used _ for the x in that one-liner. Actually, I think there's a downside if you want to extract both x and y. I haven't yet figured out if you can do it in a similar one-liner. – kawingkelvin Jan 20 '21 at 23:26
  • As a beginner python user this answer is extremely opaque to me. Not saying it's a bad answer but I think it could use a bit more context or explanation. – Jacob Waters May 07 '22 at 23:57
  • It's black magic to me but it works great – Jacob Waters May 08 '22 at 00:14
  • 1
    Couple of people want explanation. I have updated this. The key is to know what list comprehension is, and read numpy concatenate documentation. It is by no means black magic compared to other stuff. – kawingkelvin May 08 '22 at 18:57
26

Supposing our tf.data.Dataset is called train_dataset , with eager_execution on (default in TF 2.x), you can retrieve images and labels like this:

for images, labels in train_dataset.take(1):  # only take first element of dataset
    numpy_images = images.numpy()
    numpy_labels = labels.numpy()
  • the inline operation .numpy() converts tf.Tensors in numpy arrays
  • if you want to retrieve more elements of the dataset, just increase the number inside the take method. If you want all elements, just insert -1
Tommaso Di Noto
  • 1,208
  • 1
  • 13
  • 24
  • 6
    It should be noted that this method will return ```count``` batches of images in some cases, instead of individual images. – Mr. Duhart Jul 27 '20 at 20:25
11

If you are OK with keeping the images and labels as tf.Tensors, you can do

images, labels = tuple(zip(*dataset))

Think of the effect of the dataset as zip(images, labels). When we want to get images and labels back, we can simply unzip it.

If you need the numpy array version, convert them using np.array():

images = np.array(images)
labels = np.array(labels)
happymacaron
  • 450
  • 5
  • 10
  • This caused my program to crash on a dataset with ~20,000 images and 12GB of RAM. – Jacob Waters May 08 '22 at 00:09
  • Do you need the data all at once? If not, it may be a good idea to load them in batches. – happymacaron May 09 '22 at 05:39
  • Thanks! Putting `*` and `zip` consecutively seems to resolve the error: `(images,), (labels,) = zip(*training_batches.take(1))` It removes this error for me: `ValueError: not enough values to unpack (expected 2, got 1)` – Shahrokh Bah Aug 08 '22 at 10:32
8

I think we get a good example here:

https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb#scrollTo=BC4pEXtkp4K-

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

# where mnsit train is a tf dataset
mnist_train = tfds.load(name="mnist", split=tfds.Split.TRAIN)
assert isinstance(mnist_train, tf.data.Dataset)

mnist_example, = mnist_train.take(1)
image, label = mnist_example["image"], mnist_example["label"]

plt.imshow(image.numpy()[:, :, 0].astype(np.float32), cmap=plt.get_cmap("gray"))
print("Label: %d" % label.numpy())

So each individual component of the dataset can be accessed sort of like a dictionary. Presumably different datasets have different field names (Boston housing won't have image, and value, but might have 'features' and 'target' or 'price':

cnn = tfds.load(name="cnn_dailymail", split=tfds.Split.TRAIN)
assert isinstance(cnn, tf.data.Dataset)
cnn_ex, = cnn.take(1)
print(cnn_ex)

returns a dict() with keys ['article', 'highlight'] with numpy strings inside.

Dylan
  • 417
  • 4
  • 14
6

You can use TF Dataset method unbatch() to unbatch the dataset, then you can easily retrieve the data and the labels from it:

ds_labels=[]
for images, labels in ds.unbatch():
    ds_labels.append(labels) # or labels.numpy().argmax() for int labels

Or in one line:

ds_labels = [labels for _, labels in ds.unbatch()]
Youcef4k
  • 338
  • 2
  • 13
  • 1
    Why is unbatching it necessary? Can't I just iterate the batches? If I do I get a `BatchDataset` object, which doesn't act like a tensor at all. – starbeamrainbowlabs Sep 15 '22 at 16:45
1

Here is my own solution to the problem:

def dataset2numpy(dataset, steps=1):
    "Helper function to get data/labels back from TF dataset"
    iterator = dataset.make_one_shot_iterator()
    next_val = iterator.get_next()
    with tf.Session() as sess:
        for _ in range(steps):
           inputs, labels = sess.run(next_val)
           yield inputs, labels

Please note that this function will yield inputs/labels of dataset batch. The steps control how many batches from a dataset will be taken out.

Valentin
  • 1,492
  • 3
  • 18
  • 27
1

This worked for me

features = np.array([list(x[0].numpy()) for x in list(ds_test)])
labels = np.array([x[1].numpy() for x in list(ds_test)])



# NOTE: ds_test was created
iris, iris_info = tfds.load('iris', with_info=True)
ds_orig = iris['train']
ds_orig = ds_orig.shuffle(150, reshuffle_each_iteration=False)
ds_train = ds_orig.take(100)
ds_test = ds_orig.skip(100)
Sourcerer
  • 1,891
  • 1
  • 19
  • 32
1

You can use map function.

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map

images = dataset.map(lambda images, labels: images)
labels = dataset.map(lambda images, labels: labels)
Mutlu Simsek
  • 1,088
  • 14
  • 22
0
import numpy as np
import tensorflow as tf

batched_features = tf.constant([[[1, 3], [2, 3]],
                                [[2, 1], [1, 2]],
                                [[3, 3], [3, 2]]], shape=(3, 2, 2))
batched_labels = tf.constant([[0, 0],
                              [1, 1],
                              [0, 1]], shape=(3, 2, 1))
dataset = tf.data.Dataset.from_tensor_slices((batched_features, batched_labels))
classes = np.concatenate([y for x, y in dataset], axis=0)
unique = np.unique(classes, return_counts=True)
labels_dict = dict(zip(unique[0], unique[1]))
print(classes)
print(labels_dict)
# {0: 3, 1: 3}
XerCis
  • 917
  • 7
  • 6
  • 2
    While this might answer the question, if possible you should [edit] your answer to include a short explanation of *how* this code block answers the question. This helps to provide context, and makes your answer much more useful for future readers. – Hoppeduppeanut Jul 19 '21 at 06:59
0

TensorFlow's get_single_element() is finally around which can be used to extract data and labels back from datasets.

This avoids the need of generating and using an iterator using .map() or iter() (which could be costly for big datasets).

get_single_element() returns a tensor (or a tuple or dict of tensors) encapsulating all the members of the dataset. We need to pass all the members of the dataset batched into a single element.

This can be used to get features as a tensor-array, or features and labels as a tuple or dictionary (of tensor-arrays) depending upon how the original dataset was created.

Check this answer on SO for an example that unpacks features and labels into a tuple of tensor-arrays.

manisar
  • 103
  • 1
  • 7
0

https://www.tensorflow.org/tutorials/images/classification

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
  for i in range(9):
  ax = plt.subplot(3, 3, i + 1)
  plt.imshow(images[i].numpy().astype("uint8"))
  plt.title(class_names[labels[i]])
  plt.axis("off")
Imran
  • 1
0

Solution that worked for me (not reported, as of now):

Let's say I have a dataset named 'dataset'.

To get to iterate over batches in the dataset:

dataset.as_numpy_iterator()

To get a list of all batches in the dataset:

list(dataset.as_numpy_iterator())

To get the first batch in the dataset (as a list [data, labels]):

list(dataset.as_numpy_iterator())[0]

To get the 'labels' from the first batch in the dataset:

list(dataset.as_numpy_iterator())[0][1]

And so on ..

0

For tensorflow = 2.12.0 and text dataset

Load dataset

(ds_train, ds_test), ds_info = tfds.load('imdb_reviews', with_info=True, 
split=['train', 'test'], data_dir="your_dir\\tensorflow_datasets\\")

Extracting data and label

for i, dict in enumerate(ds_train.take(5)):
    print(ds_info.features['label'].int2str(dict["label"].numpy()))
    print(dict["text"].numpy())
ZKS
  • 817
  • 3
  • 16
  • 31