If you want to keep the shape of your future tensors and not add padding, I suggested popping the keys with various list length and then tf.ragged.constant()
them in a new dictionary.
In your case:
t_dic = {"uuid": np.array(["abc", "def", "ghi", "pqr"]),
"a": [np.array([1, 2, 3]),
np.array([6, 2, 3]),
np.array([6, 8, 1]),
np.array([6, 2, 3, 10])],
"b": [np.array(["a", "f", "f"]),
np.array(["aa", "ff", "fs"]),
np.array(["aa", "ff", "fs"]),
np.array(["aa", "ff", "fs", "ss"])]}
key_a = t_dic.pop("a") # popping "a" from t_dic
key_b = t_dic.pop("b") # popping "b" from t_dic
ragged_features = {"a": tf.ragged.constant(key_a), "b": tf.ragged.constant(key_b)} # creating a new dictionary with tf.ragged values of "a" and "b"
preprocessed_data = t_dic | ragged_features # joining the former and later dictonary
x = tf.data.Dataset.from_tensor_slices(preprocessed_data) # transforming in the desired output
What I found useful, as well, is to MapDataset from your x
:
x2 = x.map(lambda x: {
"uuid": x["uuid"],
"a": x["a"],
"b": x["b"]
})
The output x2
can be iterated over, batched and mapped, e.g.:
for key in x2.take(3).as_numpy_iterator():
pprint.pprint(key)
x2.element_spec # useful to check if the shape is what you want, in this case 'None' means various shapes
with an output of:
{'a': array([1, 2, 3]),
'b': array([b'a', b'f', b'f'], dtype=object),
'uuid': b'abc'}
{'a': array([6, 2, 3]),
'b': array([b'aa', b'ff', b'fs'], dtype=object),
'uuid': b'def'}
{'a': array([6, 8, 1]),
'b': array([b'aa', b'ff', b'fs'], dtype=object),
'uuid': b'ghi'}
{'uuid': TensorSpec(shape=(), dtype=tf.string, name=None),
'a': TensorSpec(shape=(None,), dtype=tf.int32, name=None),
'b': TensorSpec(shape=(None,), dtype=tf.string, name=None)}
Lastly, if you will batch the tf.data.Dataset, keep in mind that might need to use dense_to_ragged_batch()
, like so:
x2_batched = x2.apply(tf.data.experimental.dense_to_ragged_batch(batch_size=2))
Links: batching ragged tensors