Working with huge numpy arrays without loading the whole data on the disk

Question

I have a large set of image patches, which I used to store inside a numpy array [n_patchesxWxHx3] (with n_patches being super large).

I need to perform the two following operations a certain number of times:

First shuffle the indices of the patches
Then, iterate through the set by producing each time BATCH_SIZE patches at a time which I need once and then I can discard them

My approach was to use the one which can be found in TensorFlow repository (in contrib.learn.python.learn.datasets.mnist), which creates a huge numpy array and iterates on it with a next_batch method:

  def next_batch(self, batch_size, fake_data=False):
    """Return the next `batch_size` examples from this data set."""
    if fake_data:
      fake_image = [1] * 784
      if self.one_hot:
        fake_label = [1] + [0] * 9
      else:
        fake_label = 0
      return [fake_image for _ in xrange(batch_size)], [
          fake_label for _ in xrange(batch_size)
      ]
    start = self._index_in_epoch
    self._index_in_epoch += batch_size
    if self._index_in_epoch > self._num_examples:
      # Finished epoch
      self._epochs_completed += 1
      # Shuffle the data
      perm = numpy.arange(self._num_examples)
      numpy.random.shuffle(perm)
      self._images = self._images[perm]
      self._labels = self._labels[perm]
      # Start next epoch
      start = 0
      self._index_in_epoch = batch_size
      assert batch_size <= self._num_examples
    end = self._index_in_epoch
    return self._images[start:end], self._labels[start:end]

(This code is not mine - it was written by the TF team)

I know of TensorFlow binary format .tfrecords, but I would like to have access each time to Python objects (numpy arrays if possible or something that can be parsed without TF). I was wondering if there was a Pythonic way to use the same method but to load in the memory only the current patches.

EDIT 1: What would be neat if even possible is to iterate on a binary file and be able to convert a specific zone of it to a numpy array, something similar to tfrecords...

EDIT 2: Following the remark of @user6758673 I think the idea that I had is actually implemented in memmap I found the related issue using memmap for batch processing however interacting with memmap seems to be troublesome gonna try. This thread made me wonder does something like this also exist in hdf5 ? It seems scikitlearn repo is a good place to start digging...

EDIT 3: It seems I am not the first guy to run into the same issue, great discussion here on Lasagne repo

Have a look at `numpy.memmap` and `numpy.load(..., mmap_mode='r')` — user6758673, Sep 28 '16 at 18:47

Working with huge numpy arrays without loading the whole data on the disk

0 Answers0