Load data from generator into already allocated numpy array

Question

I have a large array

data = np.empty((n, k))

where both n and k are large. I also have a lot of generators g, each with k elements, and I want to load each generator into a row in data. I can do:

data[i] = list(g)

or something similar, but this makes a copy of the data in g. I can load with a for loop:

for j, x in enumerate(g):
    data[i, j] = x

but I'm wondering if numpy has a way to do this already without copying or looping in Python.

I know that g have length k in advance and am happy to do some __len__ subclass patching if necessary. np.fromiter will accept something like that when creating a new array, but I'd rather load into this already existing array if possible, due to the constraints of my context.

Possible duplicate of [How do I build a numpy array from a generator?](https://stackoverflow.com/questions/367565/how-do-i-build-a-numpy-array-from-a-generator) — ForceBru, Apr 30 '19 at 16:12
I don't think it's a dupe -- in my context I won't be able to `np.concatenate` a bunch of results from the strategies in that question. Looking for an in-place version of what's described there. If there is none, then I guess maybe it is a dupe. — mostsquares, Apr 30 '19 at 16:17
OK, actually I think it's not a great question lol. I was hoping to get some speedup from using a numpy fn instead of a for loop, but I think a python for loop is necessary because of the python nature of a generator. It's not like there is some underlying buffer that could numpy could read faster using its c extensions. — mostsquares, Apr 30 '19 at 16:21
As you already point out, looping will be necessary in any case. `np.fromiter`, which uses the array constructor [`PyArray_FromIter`](https://github.com/numpy/numpy/blob/v1.16.3/numpy/core/src/multiarray/ctors.c#L3903-L4011) does essentially just that. Unfortunately, there is no optional `out` parameter in this function, but I'm not sure you would get such a huge gain. Even from C, the program would have to keep jumping back to the Python generator, so it's never going to be super fast native-like speed. — jdehesa, Apr 30 '19 at 16:30
Yeah, that makes a lot of sense. I'd consider it the answer to this question if you're in the mood to write it below. — mostsquares, Apr 30 '19 at 16:45

score 1 · Accepted Answer · answered Apr 23 '20 at 07:11

There's not much you can do, as stated in the comments.

Although you can consider these two solutions:

using `numpy.fromiter`

Instead of creating data = np.empty((n, k)) yourself, use numpy.fromiter and the count argument, which is made specifically from this case where you know the number of items in advance. This way numpy won't have to "guess" the size and re-allocate until the guess is large enough. Using fromiter allows to run the for loop in C instead of python. This might be a tiny bit faster, but the real bottleneck will likely be in your generators anyway.

Note that fromiter only deals with flat arrays, so you need to read everything flatten (e.g. using chain.from_iterable) and only then call reshape:

from itertools import chain

n = 20
k = 4
generators = (
   (i*j for j in range(k))
   for i in range(n)
)

flat_gen = chain.from_iterable(generators)
data = numpy.fromiter(flat_gen, 'int64', count=n*k)
data = data.reshape((n, k))
"""
array([[ 0,  0,  0,  0],
       [ 0,  1,  2,  3],
       [ 0,  2,  4,  6],
       [ 0,  3,  6,  9],
       [ 0,  4,  8, 12],
       [ 0,  5, 10, 15],
       [ 0,  6, 12, 18],
       [ 0,  7, 14, 21],
       [ 0,  8, 16, 24],
       [ 0,  9, 18, 27],
       [ 0, 10, 20, 30],
       [ 0, 11, 22, 33],
       [ 0, 12, 24, 36],
       [ 0, 13, 26, 39],
       [ 0, 14, 28, 42],
       [ 0, 15, 30, 45],
       [ 0, 16, 32, 48],
       [ 0, 17, 34, 51],
       [ 0, 18, 36, 54],
       [ 0, 19, 38, 57]])
"""

using cython

If you can re-use data and want to avoid re-allocation of the memory, you can't use numpy's fromiter anymore. IMHO the only way to avoid the python's for loop is to implement it in cython. Again, this is extremely likely overkill, since you still have to read the generators in python.

For reference, the C implementation of fromiter looks like that: https://github.com/numpy/numpy/blob/v1.18.3/numpy/core/src/multiarray/ctors.c#L4001-L4118

score 0 · Answer 2 · answered Apr 30 '19 at 16:10

0

There is no faster way than the ones you described. You have to allocate each element of the numpy array, either by iterating the generator or by allocating the entire list.

answered Apr 30 '19 at 16:10

Alexis Pister

449
3
13

score 0 · Answer 3 · answered Apr 30 '19 at 16:11

Couple of things here:

1) You can just say

for whatever in g:
  do_stuff

Since g is a generator, the for loop understands how to get the data out of the generator.

2) You won't have to "copy" out of the generator necessarily (since it isn't doesn't have the entire sequence loaded in memory by design) but you will need to loop through it to fill up your numpy data structure. You might be able to squeeze out some performance (since your structures are large) with tools in numpy or itertools.

So the answer is "no" since you're using generators. If you don't need to have all of the data available at once, you can just use generators to keep the memory profile small but I don't have any context for what you are doing with the data.

It has `np.fromiter` for creating an array from scratch. I'm looking for an in-place version of that function — mostsquares, Apr 30 '19 at 16:13
There is no way to have an "in place" with a generator, it doesn't load all of the source material into memory so `fromiter` is what you'll need to use. — theWanderer4865, Apr 30 '19 at 16:15

Load data from generator into already allocated numpy array

3 Answers3

using numpy.fromiter

using cython

using `numpy.fromiter`