I have a large set of image patches, which I used to store inside a numpy array [n_patchesxWxHx3]
(with n_patches
being super large).
I need to perform the two following operations a certain number of times:
- First shuffle the indices of the patches
- Then, iterate through the set by producing each time
BATCH_SIZE
patches at a time which I need once and then I can discard them
My approach was to use the one which can be found in TensorFlow repository (in contrib.learn.python.learn.datasets.mnist
), which creates a huge numpy array and iterates on it with a next_batch
method:
def next_batch(self, batch_size, fake_data=False):
"""Return the next `batch_size` examples from this data set."""
if fake_data:
fake_image = [1] * 784
if self.one_hot:
fake_label = [1] + [0] * 9
else:
fake_label = 0
return [fake_image for _ in xrange(batch_size)], [
fake_label for _ in xrange(batch_size)
]
start = self._index_in_epoch
self._index_in_epoch += batch_size
if self._index_in_epoch > self._num_examples:
# Finished epoch
self._epochs_completed += 1
# Shuffle the data
perm = numpy.arange(self._num_examples)
numpy.random.shuffle(perm)
self._images = self._images[perm]
self._labels = self._labels[perm]
# Start next epoch
start = 0
self._index_in_epoch = batch_size
assert batch_size <= self._num_examples
end = self._index_in_epoch
return self._images[start:end], self._labels[start:end]
(This code is not mine - it was written by the TF team)
I know of TensorFlow binary format .tfrecords
, but I would like to have access each time to Python objects (numpy arrays if possible or something that can be parsed without TF). I was wondering if there was a Pythonic way to use the same method but to load in the memory only the current patches.
EDIT 1: What would be neat if even possible is to iterate on a binary file and be able to convert a specific zone of it to a numpy array, something similar to tfrecords...
EDIT 2: Following the remark of @user6758673 I think the idea that I had is actually implemented in memmap
I found the related issue using memmap for batch processing however interacting with memmap seems to be troublesome gonna try. This thread made me wonder does something like this also exist in hdf5
? It seems scikitlearn repo is a good place to start digging...
EDIT 3: It seems I am not the first guy to run into the same issue, great discussion here on Lasagne repo