209

How can I build a numpy array out of a generator object?

Let me illustrate the problem:

>>> import numpy
>>> def gimme():
...   for x in xrange(10):
...     yield x
...
>>> gimme()
<generator object at 0x28a1758>
>>> list(gimme())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> numpy.array(xrange(10))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> numpy.array(gimme())
array(<generator object at 0x28a1758>, dtype=object)
>>> numpy.array(list(gimme()))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In this instance, gimme() is the generator whose output I'd like to turn into an array. However, the array constructor does not iterate over the generator, it simply stores the generator itself. The behaviour I desire is that from numpy.array(list(gimme())), but I don't want to pay the memory overhead of having the intermediate list and the final array in memory at the same time. Is there a more space-efficient way?

Neuron
  • 5,141
  • 5
  • 38
  • 59
saffsd
  • 23,742
  • 18
  • 63
  • 67
  • 7
    This is an interesting issue. I came accross this by `from numpy import *; print any(False for i in range(1))` - which shadows the built-in [`any()`](http://docs.python.org/library/functions.html#any) and produces the opposite result (as I know now). – moooeeeep Jun 07 '12 at 11:09
  • 5
    @moooeeeep that's terrible. if `numpy` can't (or doesn't want to) to treat generators as Python does, at least it should raise an exception when it receives a generator as an argument. – max Dec 10 '12 at 00:57
  • 2
    @max I stepped on exact same mine. Apparently this was raised [on the NumPy list](http://thread.gmane.org/gmane.comp.python.numeric.general/47681/focus=47702) (and [earlier](http://thread.gmane.org/gmane.comp.python.numeric.general/13197)) concluding that this will not be changed to raise exception and one should always use namespaces. – alexei Jan 06 '14 at 21:51

5 Answers5

248

One google behind this stackoverflow result, I found that there is a numpy.fromiter(data, dtype, count). The default count=-1 takes all elements from the iterable. It requires a dtype to be set explicitly. In my case, this worked:

numpy.fromiter(something.generate(from_this_input), float)

user2357112
  • 260,549
  • 28
  • 431
  • 505
dhill
  • 3,327
  • 2
  • 19
  • 18
  • how would you apply this to the question? `numpy.fromiter(gimme(), float, count=-1)` does not work. What does `something` stand for? – Matthias 009 Mar 27 '12 at 18:54
  • something.generate is just the name of the generator – Andrew McGregor May 09 '12 at 02:03
  • 2
    @Matthias009 `numpy.fromiter(gimme(), float, count=-1)` works for me. – moooeeeep Jun 07 '12 at 10:50
  • 21
    A thread explaining why `fromiter` only works on 1D arrays: http://mail.scipy.org/pipermail/numpy-discussion/2007-August/028898.html. – max Dec 10 '12 at 01:04
  • 2
    fwiw, `count=-1` does not need to be specified, as it is the default. – askewchan Mar 27 '13 at 03:04
  • @askewchan Interestingly, `numpy` couldn't/didn't set a default `dtype=None`. If they did, it would have the same interface as `pd.array`. They could infer dtype within the function by retrieving the first element outside of the loop iterattion, and then set the array `dtype` to that first element's type like `dtype = dtype or type(iterable_first_element)` – hobs Feb 23 '15 at 23:42
  • 6
    If you know the length of the iterable beforehand, specify the `count` to improve performance. This way it allocates the memory before filling it with values rather than resizing on demand (see the documentation of `numpy.fromiter`) – Eddy Aug 14 '17 at 15:48
  • 1
    This doesn't work for object arrays. It raises an value error saying "cannot create object arrays from iterator." – Fırat Kıyak Mar 03 '22 at 00:37
150

Numpy arrays require their length to be set explicitly at creation time, unlike python lists. This is necessary so that space for each item can be consecutively allocated in memory. Consecutive allocation is the key feature of numpy arrays: this combined with native code implementation let operations on them execute much quicker than regular lists.

Keeping this in mind, it is technically impossible to take a generator object and turn it into an array unless you either:

  1. can predict how many elements it will yield when run:

    my_array = numpy.empty(predict_length())
    for i, el in enumerate(gimme()): my_array[i] = el
    
  2. are willing to store its elements in an intermediate list :

    my_array = numpy.array(list(gimme()))
    
  3. can make two identical generators, run through the first one to find the total length, initialize the array, and then run through the generator again to find each element:

    length = sum(1 for el in gimme())
    my_array = numpy.empty(length)
    for i, el in enumerate(gimme()): my_array[i] = el
    

1 is probably what you're looking for. 2 is space inefficient, and 3 is time inefficient (you have to go through the generator twice).

Cristian Ciupitu
  • 20,270
  • 7
  • 50
  • 76
shsmurfy
  • 2,164
  • 1
  • 13
  • 8
  • 13
    The builtin `array.array` is a contiguous non-linked list, and you can simply `array.array('f', generator)`. To say say it's impossible is misleading. It's just dynamic allocation. – Cuadue Feb 19 '13 at 20:50
  • 1
    Why numpy.array doesn't do the memory allocation the same way as the builtin array.array, as Cuadue says. What is the tradeof? I ask because there is contiguous allocated memory in both examples. Or not? – jgomo3 Jul 12 '13 at 22:47
  • 3
    numpy assumes its array sizes to not change. It relies heavily on different views of the same chunk of memory, so allowing arrays to be expanded and reallocated would require an additional layer of indirection to enable views, for example. – joeln Aug 04 '13 at 05:08
  • 2
    Using empty is a bit faster. Since you are going to initialize the values any way, no need to do this twice. – Kaushik Ghose Mar 24 '15 at 11:40
  • See also @dhill's answer below which is faster than 1. – Bill Jul 13 '19 at 19:16
24

While you can create a 1D array from a generator with numpy.fromiter(), you can create an N-D array from a generator with numpy.stack:

>>> mygen = (np.ones((5, 3)) for _ in range(10))
>>> x = numpy.stack(mygen)
>>> x.shape
(10, 5, 3)

It also works for 1D arrays:

>>> numpy.stack(2*i for i in range(10))
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

Note that numpy.stack is internally consuming the generator and creating an intermediate list with arrays = [asanyarray(arr) for arr in arrays]. The implementation can be found here.

[WARNING] As pointed out by @Joseh Seedy, Numpy 1.16 raises a warning that defeats usage of such function with generators.

Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107
mdeff
  • 1,566
  • 17
  • 21
  • 1
    This is a neat solution, thanks for pointing out. But it seems to be quite a bit slower (in my application) than using `np.array(tuple(mygen))`. Here are the test results: `%timeit np.stack(permutations(range(10), 7)) 1 loop, best of 3: 1.9 s per loop` compared to `%timeit np.array(tuple(permutations(range(10), 7))) 1 loop, best of 3: 427 ms per loop` – Bill Sep 17 '17 at 16:45
  • 20
    This seems great and works for me. But with Numpy 1.16.1 I get this warning: `FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.` – Joseph Sheedy Mar 04 '19 at 20:19
  • Still 7x faster than the accepted answer, even if you put a `list()` around the generator, according to my comparison with a 20,000 x 20,000 matrix. – EliadL Jun 23 '21 at 20:52
  • In numpy 1.21.2 on an old imac, `time np.array( list( np.ndindex( 1000, 1000 ))) # CPU times: user 1.43 s` `time np.stack( list( np.ndindex( 1000, 1000 ))) # CPU times: user 3.42 s` – denis Oct 18 '21 at 14:39
6

Somewhat tangential, but if your generator is a list comprehension, you can use numpy.where to more effectively get your result (I discovered this in my own code after seeing this post)

brandonscript
  • 68,675
  • 32
  • 163
  • 220
1

The vstack, hstack, and dstack functions can take as input generators that yield multi-dimensional arrays.

Mike R
  • 329
  • 2
  • 11