I have data with non rectangular shape like this:
samples_train = {'data': [np.array([[1,1]]), np.array([[1,1],[2,2]]), np.array([[1,1],[2,2],[3,3]])],
'labels': [1,2,3]}
Its a dict that contains a list of arrays with shape=[variable, 2]
.
Since I have a custom training loop, I want to access the data via a key 'data' and 'labels' (I got additional keys I store), hence the dict format.
I especially do not want to pad them to one common sequence length (So far I did pad them, and the above from_tensor_slices
approach works fine with padded same-length sequences). But now I need them not padded.
If I try:
ds = tf.data.Dataset.from_tensor_slices(samples_train)
I get this error which makes somehow sense:
ValueError: Can't convert non-rectangular Python sequence to Tensor.
So the answer to this question suggested something like:
ds = tf.data.Dataset.from_generator(
lambda: iter(zip(samples_train['data'], samples_train['labels'])),
output_types=(tf.float32, tf.float32)
)
which works fine by checking with:
for batch in ds:
print(batch)
--> output:
(<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[1., 1.]], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=1.0>)
(<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[1., 1.],
[2., 2.]], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=2.0>)
(<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[1., 1.],
[2., 2.],
[3., 3.]], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=3.0>)
But this way, I loose my dict keys.
However, I want to be able to access them like this:
for batch in ds:
print(batch['data'])
print(batch['labels'])
How can I preserve those dict keys within the dataset?