1

I have data with non rectangular shape like this:

samples_train = {'data': [np.array([[1,1]]), np.array([[1,1],[2,2]]), np.array([[1,1],[2,2],[3,3]])],          
                 'labels': [1,2,3]}

Its a dict that contains a list of arrays with shape=[variable, 2].

Since I have a custom training loop, I want to access the data via a key 'data' and 'labels' (I got additional keys I store), hence the dict format.

I especially do not want to pad them to one common sequence length (So far I did pad them, and the above from_tensor_slices approach works fine with padded same-length sequences). But now I need them not padded.

If I try:

ds = tf.data.Dataset.from_tensor_slices(samples_train)

I get this error which makes somehow sense:

ValueError: Can't convert non-rectangular Python sequence to Tensor.

So the answer to this question suggested something like:

ds = tf.data.Dataset.from_generator(
    lambda: iter(zip(samples_train['data'], samples_train['labels'])), 
    output_types=(tf.float32, tf.float32)
)

which works fine by checking with:

for batch in ds:
    print(batch)

--> output:

(<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[1., 1.]], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=1.0>)
(<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[1., 1.],
       [2., 2.]], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=2.0>)
(<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[1., 1.],
       [2., 2.],
       [3., 3.]], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=3.0>)

But this way, I loose my dict keys.

However, I want to be able to access them like this:

for batch in ds:
    print(batch['data'])
    print(batch['labels'])

How can I preserve those dict keys within the dataset?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Crysers
  • 455
  • 2
  • 13

1 Answers1

2

You can write a generator function yielding a dictionary, like so :

def my_generator(my_dict):
    for data in zip(*[my_dict[key] for key in my_dict]):
        yield {key:d for key,d in zip(my_dict.keys(), data)}

And setting the correct output_types in the from_generator function.

Results in

>>> ds = tf.data.Dataset.from_generator(
    lambda: my_generator(samples_train),
    output_types={"data": tf.float32, "labels": tf.float32})  
>>> for batch in ds:
      print(batch['data'])
      print(batch['labels'])
tf.Tensor([[1. 1.]], shape=(1, 2), dtype=float32)
tf.Tensor(1.0, shape=(), dtype=float32)
tf.Tensor(
[[1. 1.]
 [2. 2.]], shape=(2, 2), dtype=float32)
tf.Tensor(2.0, shape=(), dtype=float32)
tf.Tensor(
[[1. 1.]
 [2. 2.]
 [3. 3.]], shape=(3, 2), dtype=float32)
tf.Tensor(3.0, shape=(), dtype=float32)
Lescurel
  • 10,749
  • 16
  • 39
  • 1
    This is great, thanks! I just got it to work using `ds.map(parse_fn)` with `parse_fn(*args): return {'data': args[0], 'labels': args[1]}`. But yours is way more elegant! Thanks. – Crysers Jan 19 '21 at 17:16