I set up my pipeline starting with a filename queue as in the following pseudocode:
filename_queue = tf.train.string_input_producer(["file0.pd", "file1.pd"])
pointing to TFRecords
containing multiple serialized tf.train.Example
images.
Following the tensorflow guide a function which reads one example:
def read_my_file_format(filename_queue):
reader = tf.SomeReader()
key, record_string = reader.read(filename_queue)
example, label = tf.some_decoder(record_string)
processed_example = some_processing(example)
return processed_example, label
which is used for a batch queue:
def input_pipeline(filenames, batch_size):
filename_queue = tf.train.string_input_producer(filenames)
example, label = read_my_file_format(filename_queue)
example_batch, label_batch = tf.train.shuffle_batch(
[example, label], batch_size=batch_size, capacity=100,
min_after_dequeue=10)
return example_batch, label_batch
I am looking for a way to split the data randomly into training and test sets. I don't want to save the training and test set into different files, but that the images are randomly assigned to the training or the test set independent of the file they are read from. Ideally I would like to split the input pipeline into a training and test queue.
Here is what I normally do in numpy when I have to split a huge dataset
import numpy as np
from numpy.random import choice
from numpy.random import RandomState
queue = range(10)
weights = (.8,.2) # create 2 partitions with this weights
def sampler(partition, seed=0):
rng = RandomState(seed)
return lambda x: rng.choice(np.arange(len(weights)), p=weights) == partition
def split(queue, weights):
# filter the queue for each partition
return [filter(sampler(partition), queue) for partition in range(len(weights)) ]
(train, test) = split(queue, weights)
print(list(train)) # [0, 1, 2, 3, 4, 5, 6, 9]
print(list(test)) # [7, 8]