1

I have two TFRecords A and B of different sizes and containing different data elements.

I need to take all possible pairs of records from A and B. Therefore, during training or testing, I would like the signal of epoch to end only when all combinations have been exhausted, after which the process should resume for the next epoch.

In doing this, of course, I would like to specify a batchsize.

I have gone through the documentation of tf.data.Dataset and have found nothing which does something like this.

Of course, if I were to write a python generator, this could be accomplished. But unfortunately, this is not useful because according to documentation, python generators will be bounded by the GIL i.e the global interpreter lock.

Thus, suppose that,

A contains {image1, image2, image3}, while B contains {im1, im2, im3, im4, im5, im6}. And I have specified a batchsize of 2. Then I would like the output to be something like following :

(image1, im1) and (image2, im4)

(image3, im2) and (image1, im2)

(image2, im1) and (image2, im3)

..............

15 more combinations

and then the next epoch starts.

How can that be achieved in TensorFlow ?

uj14
  • 91
  • 4

2 Answers2

0

There are some SO posts about how to compute the cartesian product of two arrays using Numpy or Tensorflow.

If your arrays are two big for an in-memory computation, your best bet is probably to use two tf.data.Dataset (on for each array) and make a double loop:

for a in dataset_A:
  for b in dataset_B.batch(2):
     batch = [[a, b[0]], [a, b[1]] # Or something similar (it should have a TF function to do it)

Using @tf.function, looping over a Dataset is known to be fast.

AlexisBRENON
  • 2,921
  • 2
  • 18
  • 30
0

You can employ tf.data.Dataset.from_generator function where generator function will implement your logic, e.g. cross product of two other datasets. To randomly draw a pair of sample from zipped dataset db1 and db2, i shuffled each dataset independently.

import tensorflow as tf
tf.enable_eager_execution()

A = [1, 2, 3, 4]
B = [5, 6, 7, 8]

db1 = tf.data.Dataset.from_tensor_slices(A).shuffle(len(A)).repeat()
db2 = tf.data.Dataset.from_tensor_slices(B).shuffle(len(B)).repeat()

def cross_db_generator():
    for db1_example, db2_example in zip(db1, db2):
        print(db1_example.numpy(), db2_example.numpy())
        yield db1_example, db2_example


cross_db = tf.data.Dataset.from_generator(cross_db_generator, output_types=(tf.uint8, tf.uint8))
cross_db = cross_db.batch(2)

for sample in cross_db:
    print((sample[0][0].numpy(), sample[1][0].numpy()), (sample[0][1].numpy(), sample[1][1].numpy()))
Kaushik Roy
  • 1,627
  • 2
  • 11
  • 13
  • using `from_generator` will not be efficient because it is bound by Python's GIL right ? – uj14 Oct 08 '19 at 12:21
  • Sorry, I haven't investigated on the efficiency of `from_generator`. You may want to check out this question on:[parallelizing tf.data.Dataset.from_generator](https://stackoverflow.com/questions/47086599). – Kaushik Roy Oct 08 '19 at 12:36