1

I have a nested numpy array that I want to feed to an RNN model in Tensorflow/Keras. Predictions will be made at the person level so that's the first dimension of the array. Each person will have 1 or more events (2nd dimension) and each event will have 1 or more codes (3rd dimension). In other words, dimensions 2 and 3 have varying lengths.

For the first version of the training code, I loaded all the data in memory and sliced it/padded the mini-batches as needed during training. Slicing/padding is a Sequence Keras class that process numpy arrays.

Now I have way more data to train the model so I cannot load it all in memory. The plan is to save it into several TFRecord files and then load/pad them in small batches as needed during training.

I am using Tensorflow 1.14.0 with Python 3.6.

Since the inner dimensions are of varying length, I have tried to use tf.data.Dataset.from_generator.

Question: how to fix the minimal example below (if possible)?

codes = np.array([np.array([np.array([527,  38, 734]),
                            np.array([  4, 935])]),
                  np.array([np.array([810])]),
                  np.array([np.array([315, 802])]),
                  np.array([np.array([317,  29, 861]),
                            np.array([906]),
                            np.array([439, 655, 893, 130])])])

codes_dataset = tf.data.Dataset.from_generator(lambda: codes, (tf.int64, tf.int64))

print(codes_dataset)
# <DatasetV1Adapter shapes: (<unknown>, <unknown>), types: (tf.int64, tf.int64)>

for value in codes_dataset:
    print(value)

codes_dataset is created but the for loop errors out:

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-129-d4ed489ff27f> in <module>()
----> 1 for value in codes_dataset:
      2     print(value)

~opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py in __next__(self)
    584 
    585   def __next__(self):  # For Python 3 compatibility
--> 586     return self.next()
    587 
    588   def _next_internal(self):

~opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py in next(self)
    621     """
    622     try:
--> 623       return self._next_internal()
    624     except errors.OutOfRangeError:
    625       raise StopIteration

~opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py in _next_internal(self)
    613             self._iterator_resource,
    614             output_types=self._flat_output_types,
--> 615             output_shapes=self._flat_output_shapes)
    616 
    617       return self._structure._from_compatible_tensor_list(ret)  # pylint: disable=protected-access

~opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py in iterator_get_next_sync(iterator, output_types, output_shapes, name)
   2118       else:
   2119         message = e.message
-> 2120       _six.raise_from(_core._status_to_exception(e.code, message), None)
   2121   # Add nodes to the TensorFlow graph.
   2122   if not isinstance(output_types, (list, tuple)):

~opt/tools/python/anaconda3/lib/python3.6/site-packages/six.py in raise_from(value, from_value)

InvalidArgumentError: TypeError: `generator` yielded an element that did not match the expected structure. The expected structure was (tf.int64, tf.int64), but the yielded element was [array([527,  38, 734]) array([  4, 935])].
Traceback (most recent call last):

  File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 520, in generator_py_func
    flattened_values = nest.flatten_up_to(output_types, values)

  File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/util/nest.py", line 398, in flatten_up_to
    assert_shallow_structure(shallow_tree, input_tree)

  File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/util/nest.py", line 301, in assert_shallow_structure
    "Input has type: %s." % type(input_tree))

TypeError: If shallow structure is a sequence, input must also be a sequence. Input has type: <class 'numpy.ndarray'>.


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 209, in __call__
    ret = func(*args)

  File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 525, in generator_py_func
    "element was %s." % (output_types, values))

TypeError: `generator` yielded an element that did not match the expected structure. The expected structure was (tf.int64, tf.int64), but the yielded element was [array([527,  38, 734]) array([  4, 935])].


     [[{{node PyFunc}}]] [Op:IteratorGetNextSync]

jaobalao
  • 11
  • 2
  • Why do you think using a generator will make variable size arrays more acceptable? Is there something in the tensorflow docs about that? – hpaulj Jul 24 '19 at 00:34
  • I decided to try from_generator because of Derek Murray's responses/coments: https://stackoverflow.com/questions/47580716/how-to-input-a-list-of-lists-with-different-sizes-in-tf-data-dataset https://stackoverflow.com/questions/46511328/tensorflow-dataset-from-generator-fails-with-pyfunc-exception/46557087#46557087 – jaobalao Jul 24 '19 at 02:51
  • Wouldn't it be easier to detect a pattern if all samples had the same shape or number of features? – hpaulj Jul 24 '19 at 07:03
  • I can't load all the data in memory so it would not be easy to separate them by shape. The fact that the size is variable in 2 dimensions doesn't help either. There are a lot of different shapes and some will have 1 or very few samples. – jaobalao Jul 24 '19 at 14:38

0 Answers0