I have a nested numpy array that I want to feed to an RNN model in Tensorflow/Keras. Predictions will be made at the person level so that's the first dimension of the array. Each person will have 1 or more events (2nd dimension) and each event will have 1 or more codes (3rd dimension). In other words, dimensions 2 and 3 have varying lengths.
For the first version of the training code, I loaded all the data in memory and sliced it/padded the mini-batches as needed during training. Slicing/padding is a Sequence Keras class that process numpy arrays.
Now I have way more data to train the model so I cannot load it all in memory. The plan is to save it into several TFRecord files and then load/pad them in small batches as needed during training.
I am using Tensorflow 1.14.0 with Python 3.6.
Since the inner dimensions are of varying length, I have tried to use tf.data.Dataset.from_generator.
Question: how to fix the minimal example below (if possible)?
codes = np.array([np.array([np.array([527, 38, 734]),
np.array([ 4, 935])]),
np.array([np.array([810])]),
np.array([np.array([315, 802])]),
np.array([np.array([317, 29, 861]),
np.array([906]),
np.array([439, 655, 893, 130])])])
codes_dataset = tf.data.Dataset.from_generator(lambda: codes, (tf.int64, tf.int64))
print(codes_dataset)
# <DatasetV1Adapter shapes: (<unknown>, <unknown>), types: (tf.int64, tf.int64)>
for value in codes_dataset:
print(value)
codes_dataset is created but the for loop errors out:
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-129-d4ed489ff27f> in <module>()
----> 1 for value in codes_dataset:
2 print(value)
~opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py in __next__(self)
584
585 def __next__(self): # For Python 3 compatibility
--> 586 return self.next()
587
588 def _next_internal(self):
~opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py in next(self)
621 """
622 try:
--> 623 return self._next_internal()
624 except errors.OutOfRangeError:
625 raise StopIteration
~opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py in _next_internal(self)
613 self._iterator_resource,
614 output_types=self._flat_output_types,
--> 615 output_shapes=self._flat_output_shapes)
616
617 return self._structure._from_compatible_tensor_list(ret) # pylint: disable=protected-access
~opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py in iterator_get_next_sync(iterator, output_types, output_shapes, name)
2118 else:
2119 message = e.message
-> 2120 _six.raise_from(_core._status_to_exception(e.code, message), None)
2121 # Add nodes to the TensorFlow graph.
2122 if not isinstance(output_types, (list, tuple)):
~opt/tools/python/anaconda3/lib/python3.6/site-packages/six.py in raise_from(value, from_value)
InvalidArgumentError: TypeError: `generator` yielded an element that did not match the expected structure. The expected structure was (tf.int64, tf.int64), but the yielded element was [array([527, 38, 734]) array([ 4, 935])].
Traceback (most recent call last):
File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 520, in generator_py_func
flattened_values = nest.flatten_up_to(output_types, values)
File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/util/nest.py", line 398, in flatten_up_to
assert_shallow_structure(shallow_tree, input_tree)
File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/util/nest.py", line 301, in assert_shallow_structure
"Input has type: %s." % type(input_tree))
TypeError: If shallow structure is a sequence, input must also be a sequence. Input has type: <class 'numpy.ndarray'>.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 209, in __call__
ret = func(*args)
File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 525, in generator_py_func
"element was %s." % (output_types, values))
TypeError: `generator` yielded an element that did not match the expected structure. The expected structure was (tf.int64, tf.int64), but the yielded element was [array([527, 38, 734]) array([ 4, 935])].
[[{{node PyFunc}}]] [Op:IteratorGetNextSync]