How to correctly map a python function and then batch the Dataset in Tensorflow

Question

I wish to create a pipeline to provide non-standard files to the neural network (for example with extension *.xxx). Currently I have structured my code as follows:

1) I define a list of paths where to find training files

2) I define an instance of the tf.data.Dataset object containing these paths

3) I map to the Dataset a python function that takes each path and returns the associated numpy array (loaded from the folder on the pc); this array is a matrix with dimensions [256, 256, 192].

4) I define an initializable iterator and then use it during network training.

My doubt lies in the size of the batch I provide to the network. I would like to have batches of size 64 supplied to the network. How could I do? For example, if I use the function train_data.batch(b_size) with b_size = 1 the result is that when iterated, the iterator gives one element of shape [256, 256, 192]; what if I wanted to feed the neural net with just 64 slices of this array?

This is an extract of my code:

    with tf.name_scope('data'):
        train_filenames = tf.constant(list_of_files_train)

        train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
        train_data = train_data.map(lambda filename: tf.py_func(
            self._parse_xxx_data, [filename], [tf.float32]))

        train_data.shuffle(buffer_size=len(list_of_files_train))
        train_data.batch(b_size)

        iterator = tf.data.Iterator.from_structure(train_data.output_types, train_data.output_shapes)

        input_data = iterator.get_next()
        train_init = iterator.make_initializer(train_data)

  [...]

  with tf.Session() as sess:
      sess.run(train_init)
      _ = sess.run([self.train_op])

Thanks in advance

----------

I posted a solution to my problem in the comments below. I would still be happy to receive any comment or suggestion on possible improvements. Thank you ;)

score 1 · Accepted Answer · answered Oct 20 '19 at 17:52

It's been a long time but I'll post a possible solution to batch the dataset with custom shape in TensorFlow, in case someone may need it.

The module tf.data offers the method unbatch() to unwrap the content of each dataset element. One can first unbatch and than batch again the dataset object in the desired way. Oftentimes, a good idea may also be shuffling the unbatched dataset before batching it again (so that we have random slices from random elements in each batch):

with tf.name_scope('data'):
    train_filenames = tf.constant(list_of_files_train)

    train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
    train_data = train_data.map(lambda filename: tf.py_func(
        self._parse_xxx_data, [filename], [tf.float32]))

    # un-batch first, then batch the data
    train_data = train_data.apply(tf.data.experimental.unbatch())

    train_data.shuffle(buffer_size=BSIZE)
    train_data.batch(b_size)

    # [...]

score -1 · Answer 2 · answered Jun 26 '18 at 13:18

-1

If I clearly understand you question, you can try to slice the array into the shape you want in your self._parse_xxx_data function.

answered Jun 26 '18 at 13:18

Fan Luo

106
6

Thank you for the reply. Unfortunately, due to time constraints, it would be better to batch the array after calling this function. In fact, due to many constraints, I have to produce an array of [256, 256, 192] within the self.parse_xxx_data() function. Specifically, these are 192 images of size 256x256, the production of which is time consuming: so I do not want to take only 64 images because it would be a waste of "production" (since the discarded ones can be useful to train the neural net). – gab Jun 26 '18 at 13:32
Maybe you should first put 1 image in 1 file before training because when training a neural network, you have to make sure the images in each batch is randomly sampled from the whole dataset. – Fan Luo Jun 26 '18 at 14:07
Unfortunately this is also a bad choice in my case, because I need to pre-process the volumes as a whole (i.e. doing random 3D rotation, standardization on the whole volume statistics, etc.) and then slice these 3D volumes at run-time because I want to work on 2D slices with the neural net. Saving such a huge amount of 2D slices would be definitely a sub-optimal and unpractical choice for me. – gab Jun 26 '18 at 14:13

How to correctly map a python function and then batch the Dataset in Tensorflow

----------

2 Answers2

Linked