2

Based on the answers here it doesn't seem like there's an easy way to fill a 2D numpy array with data from a generator.

However, if someone can think of a way to vectorize or otherwise speed up the following function I would appreciate it.

The difference here is that I want to process the values from the generator in batches rather than create the whole array in memory. The only way I could think of doing that was with a for loop.

import numpy as np
from itertools import permutations

permutations_of_values = permutations(range(1,20), 7)

def array_from_generator(generator, arr):
    """Fills the numpy array provided with values from
    the generator provided. Number of columns in arr
    must match the number of values yielded by the 
    generator."""
    count = 0
    for row in arr:
        try:
            item = next(generator)
        except StopIteration:
            break
        row[:] = item
        count += 1
    return arr[:count,:]

batch_size = 100000

empty_array = np.empty((batch_size, 7), dtype=int)
batch_of_values = array_from_generator(permutations_of_values, empty_array)

print(batch_of_values[0:5])

Output:

[[ 1  2  3  4  5  6  7]
 [ 1  2  3  4  5  6  8]
 [ 1  2  3  4  5  6  9]
 [ 1  2  3  4  5  6 10]
 [ 1  2  3  4  5  6 11]]

Speed test:

%timeit array_from_generator(permutations_of_values, empty_array)
10 loops, best of 3: 137 ms per loop

ADDITION:

As suggested by @COLDSPEED (thanks) here is a version that uses a list to gather the data from the generator. It's about twice as fast as above code. Can anyone improve on this:

permutations_of_values = permutations(range(1,20), 7)

def array_from_generator2(generator, rows=batch_size):
    """Creates a numpy array from a specified number 
    of values from the generator provided."""
    data = []
    for row in range(rows):
        try:
            data.append(next(generator))
        except StopIteration:
            break
    return np.array(data)

batch_size = 100000

batch_of_values = array_from_generator2(permutations_of_values, rows=100000)

print(batch_of_values[0:5])

Output:

[[ 1  2  3  4  5  6  7]
 [ 1  2  3  4  5  6  8]
 [ 1  2  3  4  5  6  9]
 [ 1  2  3  4  5  6 10]
 [ 1  2  3  4  5  6 11]]

Speed test:

%timeit array_from_generator2(permutations_of_values, rows=100000)
10 loops, best of 3: 85.6 ms per loop
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
Bill
  • 10,323
  • 10
  • 62
  • 85
  • 1
    It should be simpler to fill a list and then call `np.array` on the resultant. – cs95 Sep 18 '17 at 01:57
  • 1
    `fromiter`, as discussed in a couple of the linked answers, is the only way of creating an array directly from the output of a generator. Otherwise you need to create a list and build or fill in the array from that. Generators can save memory during intermediate processing (c.f. to the list equivalent), but aren't any faster. – hpaulj Sep 18 '17 at 02:11
  • 1
    `fromiter` would be great but it only works with series (1-dimensional arrays). – Bill Sep 18 '17 at 02:19
  • Can you know ahead of time the dimensions? Then you can still use `fromiter` – juanpa.arrivillaga Sep 18 '17 at 02:32
  • If you read the documentation, it states that `fromiter` creates "a new 1-dimensional array from an iterable object". What I am trying to do here is 2-dimensional because each item from the generator is a tuple of 7 values. Maybe it's time to extend `fromiter` to handle multi-dimensional iterators... – Bill Sep 18 '17 at 02:48

1 Answers1

3

You can calculate the sizes ahead in essentially constant time. Just do that, and use numpy.fromiter:

In [1]: import math, from itertools import permutations, chain

In [2]: def n_chose_k(n, k, fac=math.factorial):
    ...:     return fac(n)/fac(n-k)
    ...:

In [3]: def permutations_to_array(r, k):
    ...:     n = len(r)
    ...:     size = int(n_chose_k(n, k))
    ...:     it = permutations(r, k)
    ...:     arr = np.fromiter(chain.from_iterable(it),
    ...:                       count=size,  dtype=int)
    ...:     arr.size = size//k, k
    ...:     return arr
    ...:

In [4]: arr = permutations_to_array(range(1,20), 7)

In [5]: arr.shape
Out[5]: (36279360, 7)

In [6]: arr[0:5]
Out[6]:
array([[ 1,  2,  3,  4,  5,  6,  7],
       [ 1,  2,  3,  4,  5,  6,  8],
       [ 1,  2,  3,  4,  5,  6,  9],
       [ 1,  2,  3,  4,  5,  6, 10],
       [ 1,  2,  3,  4,  5,  6, 11]])

This will work as long as r is limited to sequences with a len.

Edited to add an implementation I cooked up for a generator of batchsize*k chunks, with a trim option!

import math
from itertools import repeat, chain

import numpy as np

def n_chose_k(n, k, fac=math.factorial):
    return fac(n)/fac(n-k)

def permutations_in_batches(r, k, batchsize=None, fill=0, dtype=int, trim=False):
    n = len(r)
    size = int(n_chose_k(n, k))
    if batchsize is None or batchsize > size:
        batchsize = size
    perms = chain.from_iterable(permutations(r, k))
    count = batchsize*k
    remaining = size - count
    while remaining > 0:
        current = np.fromiter(perms, count=count, dtype=dtype)
        current.shape = batchsize, k
        yield current
        remaining -= count
    if remaining: # remaining is negative
        remaining = -remaining
        if not trim:
            padding = repeat(fill, remaining)
            finalcount = count
            finalshape = batchsize, k
        else:
            q = remaining//k # always divisible q%k==0
            finalcount = q*k
            padding = repeat(fill, remaining)
            finalshape = q, k
        current =  np.fromiter(chain(perms, padding), count=finalcount, dtype=dtype)
        current.shape = finalshape
    else: # remaining is 0
        current = np.fromiter(perms, count=batchsize, dtype=dtype)
        current.shape = batchsize, k
    yield current
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • Where does `n_chose_k` come from? – Bill Sep 18 '17 at 03:09
  • 1
    @Bill sorry, forgot to put that in the answer – juanpa.arrivillaga Sep 18 '17 at 03:17
  • Very good thanks. I think the dimensions of the resulting array are not quite right though. Shouldn't it be `count=size*k` and `arr.resize((size, k))`? – Bill Sep 18 '17 at 03:53
  • @Bill, In my experience the `fromiter` `count` doesn't make a big difference in run time. And `arr.reshape(-1,k)` doesn't require this size either. – hpaulj Sep 18 '17 at 04:05
  • Thanks @juanpa.arrivillaga. Once I implemented your suggestion with a modification to allow it to yield one array at a time of size `rows=100000`, I timed it with the same dimensions as my example code above and the the results were `27 ms per loop`. Fantastic. I could post the full solution but I don't want to take away from your answer... – Bill Sep 18 '17 at 04:27
  • @Bill yes, it should be easy enough to write an iterator that yields batches. – juanpa.arrivillaga Sep 18 '17 at 04:31
  • 1
    @Bill ended up cooking one up for fun if you want to see it – juanpa.arrivillaga Sep 18 '17 at 06:01