Making a TF Dataset with unequal list sizes

Question

I am trying to make a tf dataset from this dictionary, where the dataset will have four elements and the last element has lists that are different from other lists.

When doing so, I get an error ValueError: Can't convert non-rectangular Python sequence to Tensor..

The solution explained here - using tf.ragged.constant(data) does not work since I am using a dictionary. Is there a way to make such dataset?

t_dic = {"uuid": np.array(["abc", "def", "ghi", "pqr"]),
         "a": [np.array([1, 2, 3]), 
               np.array([6, 2, 3]), 
               np.array([6, 8, 1]), 
               np.array([6, 2, 3, 10])],
         "b": [np.array(["a", "f", "f"]), 
               np.array(["aa", "ff", "fs"]), 
               np.array(["aa", "ff", "fs"]), 
               np.array(["aa", "ff", "fs", "ss"])]}
x = tf.data.Dataset.from_tensor_slices(t_dic)

Maybe you can consider using zero paddings for your dataset to make it more Tensorflow friendly? — Minh Nguyen, Oct 26 '21 at 04:30
Yeah, that's an option. But I am interested to have different tensor sizes. — ahoosh, Oct 31 '21 at 01:21

score 0 · Answer 1 · answered Dec 07 '22 at 11:16

If you want to keep the shape of your future tensors and not add padding, I suggested popping the keys with various list length and then tf.ragged.constant() them in a new dictionary.

In your case:

t_dic = {"uuid": np.array(["abc", "def", "ghi", "pqr"]),
         "a": [np.array([1, 2, 3]), 
               np.array([6, 2, 3]), 
               np.array([6, 8, 1]), 
               np.array([6, 2, 3, 10])],
         "b": [np.array(["a", "f", "f"]), 
               np.array(["aa", "ff", "fs"]), 
               np.array(["aa", "ff", "fs"]), 
               np.array(["aa", "ff", "fs", "ss"])]}

key_a = t_dic.pop("a")  # popping "a" from t_dic
key_b = t_dic.pop("b")  # popping "b" from t_dic
ragged_features = {"a": tf.ragged.constant(key_a), "b": tf.ragged.constant(key_b)}  # creating a new dictionary with tf.ragged values of "a" and "b"
preprocessed_data = t_dic | ragged_features  # joining the former and later dictonary
x = tf.data.Dataset.from_tensor_slices(preprocessed_data)  # transforming in the desired output

What I found useful, as well, is to MapDataset from your x:

x2 = x.map(lambda x: {
   "uuid": x["uuid"],
   "a": x["a"],
   "b": x["b"]
   })

The output x2 can be iterated over, batched and mapped, e.g.:

for key in x2.take(3).as_numpy_iterator():
   pprint.pprint(key)
x2.element_spec  # useful to check if the shape is what you want, in this case 'None' means various shapes

with an output of:

{'a': array([1, 2, 3]),
 'b': array([b'a', b'f', b'f'], dtype=object),
 'uuid': b'abc'}
{'a': array([6, 2, 3]),
 'b': array([b'aa', b'ff', b'fs'], dtype=object),
 'uuid': b'def'}
{'a': array([6, 8, 1]),
 'b': array([b'aa', b'ff', b'fs'], dtype=object),
 'uuid': b'ghi'}

{'uuid': TensorSpec(shape=(), dtype=tf.string, name=None),
 'a': TensorSpec(shape=(None,), dtype=tf.int32, name=None),
 'b': TensorSpec(shape=(None,), dtype=tf.string, name=None)}

Lastly, if you will batch the tf.data.Dataset, keep in mind that might need to use dense_to_ragged_batch(), like so:

x2_batched = x2.apply(tf.data.experimental.dense_to_ragged_batch(batch_size=2))

Links: batching ragged tensors

Making a TF Dataset with unequal list sizes

1 Answers1