4

I have tokenized data in the form of a list of unequally shaped arrays:

array([array([1179,    6,  208,    2, 1625,   92,    9, 3870,    3, 2136,  435,
          5, 2453, 2180,   44,    1,  226,  166,    3, 4409,   49, 6728,
         ...
         10,   17, 1396,  106, 8002, 7968,  111,   33, 1130,   60,  181,
       7988, 7974, 7970])], dtype=object)

With their respective targets:

Out[74]: array([0, 0, 0, ..., 0, 0, 1], dtype=object)

I'm trying to transform them into a padded tf.data.Dataset(), but it won't let me convert unequal shapes to a tensor. I will get this error:

ValueError: Can't convert non-rectangular Python sequence to Tensor.

The full code is here. Assume that my starting point is after y = ...:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np

(train_data, test_data) = tfds.load('imdb_reviews/subwords8k',
                                    split=(tfds.Split.TRAIN, tfds.Split.TEST),
                                    as_supervised=True)

x = np.array(list(train_data.as_numpy_iterator()))[:, 0]
y = np.array(list(train_data.as_numpy_iterator()))[:, 1]


train_tensor = tf.data.Dataset.from_tensor_slices((x.tolist(), y))\
    .padded_batch(batch_size=8, padded_shapes=([None], ()))

What are my options to turn this into a padded batch tensor?

today
  • 32,602
  • 8
  • 95
  • 115
Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143

1 Answers1

5

If your data is stored in Numpy arrays or Python lists, then you can use tf.data.Dataset.from_generator method to create the dataset and then pad the batches:

train_batches = tf.data.Dataset.from_generator(
    lambda: iter(zip(x, y)), 
    output_types=(tf.int64, tf.int64)
).padded_batch(
    batch_size=32,
    padded_shapes=([None], ())
)

However, if you are using tensorflow_datasets.load function, then there is no need to use as_numpy_iterator to separate the data and the labels, and then put them back together in a dataset! That's redundant and inefficient. The objects returned by tensorflow_datasets.load are already an instance of tf.data.Dataset. So, you just need to use padded_batch on them:

train_batches = train_data.padded_batch(batch_size=32, padded_shapes=([None], []))
test_batches = test_data.padded_batch(batch_size=32, padded_shapes=([None], []))

Note that in TensorFlow 2.2 and above, you no longer need to provide the padded_shapes argument if you just want all the axes to be padded to the longest of the batch (i.e. default behavior).

today
  • 32,602
  • 8
  • 95
  • 115
  • 1
    The reasoning for that is when I do a real task, I won't have a convenient TFDS object. It's more likely that I will have a list of list/arrays, and the targets separately. This is why I said "Assume that my starting point is after y = ...". Thanks for informing me of this update. – Nicolas Gervais Apr 21 '20 at 20:04
  • @NicolasGervais Oh, sorry! I did not pay enough attention to that. Please take a look at my updated answer for a solution for the case when your data is stored in Numpy arrays/Python lists. – today Apr 21 '20 at 20:52
  • @NicolasGervais Didn't the solution for Numpy array work for you? – today Apr 23 '20 at 01:53
  • 1
    The project is on hold, I'll try this out once it's resumed, and I'll get back to you then. – Nicolas Gervais Apr 23 '20 at 01:55
  • 1
    Why don't you participate in the Keras tag anymore? – Nicolas Gervais Mar 21 '21 at 04:32
  • 1
    @NicolasGervais Well, one reason is the lack of enough time on my side and the other one is the lack of good interesting questions being asked here. Naturally, as ML and TF/Keras has become popular, more and more people are using it and therefore the percentage of newbie, duplicate, poor or please-debug-it-for-me questions have increased. Both of these reasons have reduced my motivation for answering questions in Keras tag and I have stopped monitoring this tag. Although, from time to time, people reach out to me via email with their question and I'll try to help them as much as I can. – today Mar 21 '21 at 11:21
  • @Nicolas Gervais can you help me in this issue: https://stackoverflow.com/questions/74251623/how-to-solve-valueerror-cant-convert-non-rectangular-python-sequence-to-tensor – A_B_Y Oct 30 '22 at 09:39