Among the things I've tried so far, a PyTables solution is currently the best, followed by a solution that uses numpy
's support for memmapped arrays. The PyTables solution is not straightforward though. If you use a shuffled array of integers to directly index a PyTables array, it's very slow. Much faster is the following two-step process:
- Select a random subset of the array using a boolean index array. This must be done in a chunkwise fashion. If you pass the index array directly to the PyTables array, it's slow.
- Preallocate a numpy array and create a list of slices that split the PyTables array into chunks.
- Read each chunk entirely into memory, and then use the corresponding chunk of the index array to select the correct values for that chunk.
- Store the selected values in the preallocated array.
- Then shuffle the preallocated array.
This process produces a permutation as random as a normal shuffling process would. If that doesn't seem obvious, consider this: (n choose x) * x! = x! * n! / (x! * (n - x)!) = n! / (n - x)!
. This method is fast enough to do a shuffle-on-load for every training cycle. It's also able to compress the data down to ~650M -- nearly a 90% deflation.
Here's my current implementation; this is called once for every training chunk in the corpus. (The returned arrays are shuffled elsewhere.)
def _h5_fast_bool_ix(self, h5_array, ix, read_chunksize=100000):
'''Iterate over an h5 array chunkwise to select a random subset
of the array. `h5_array` should be the array itself; `ix` should
be a boolean index array with as many values as `h5_array` has
rows; and you can optionally set the number of rows to read per
chunk with `read_chunksize` (default is 100000). For some reason
this is much faster than using `ix` to index the array directly.'''
n_chunks = h5_array.shape[0] / read_chunksize
slices = [slice(i * read_chunksize, (i + 1) * read_chunksize)
for i in range(n_chunks)]
a = numpy.empty((ix.sum(), h5_array.shape[1]), dtype=float)
a_start = 0
for sl in slices:
chunk = h5_array[sl][ix[sl]]
a_end = a_start + chunk.shape[0]
a[a_start:a_end] = chunk
a_start = a_end
return a
It's somewhat crazy to me that an O(n^2) approach (iterating over the entire PyTables array for every chunk) is faster in this case than an O(n) approach (randomly selecting each row in one pass). But hey, it works. With a bit more indirection, this could be adapted for loading arbitrary non-random permutations, but that adds more complexity than it's worth here.
The mmap
solution is here for reference, for those people who need a pure numpy solution for whatever reason. It shuffles all the data in about 25 minutes, while the above solution manages the same in less than half that time. This should scale linearly too, because mmap
allows (relatively) efficient random access.
import numpy
import os
import random
X = []
Y = []
for filename in os.listdir('input'):
X.append(numpy.load(os.path.join('input', filename), mmap_mode='r'))
for filename in os.listdir('output'):
Y.append(numpy.load(os.path.join('output', filename), mmap_mode='r'))
indices = [(chunk, row) for chunk, rows in enumerate(X)
for row in range(rows.shape[0])]
random.shuffle(indices)
newchunks = 50
newchunksize = len(indices) / newchunks
for i in range(0, len(indices), newchunksize):
print i
rows = [X[chunk][row] for chunk, row in indices[i:i + newchunksize]]
numpy.save('X_shuffled_' + str(i), numpy.array(rows))
rows = [Y[chunk][row] for chunk, row in indices[i:i + newchunksize]]
numpy.save('Y_shuffled_' + str(i), numpy.array(rows))