How to construct an np.array with fromiter

Question

I'm trying to construct an np.array by sampling from a python generator, that yields one row of the array per invocation of next. Here is some sample code:

import numpy as np
data = np.eye(9)
labels = np.array([0,0,0,1,1,1,2,2,2])

def extract_one_class(X,labels,y):
""" Take an array of data X, a column vector array of labels, and one particular label y.  Return an array of all instances in X that have label y """

    return X[np.nonzero(labels[:] == y)[0],:]

def generate_points(data, labels, size):
""" Generate and return 'size' pairs of points drawn from different classes """

     label_alphabet = np.unique(labels)
     assert(label_alphabet.size > 1)

     for useless in xrange(size):
         shuffle(label_alphabet)
         first_class = extract_one_class(data,labels,label_alphabet[0])
         second_class = extract_one_class(data,labels,label_alphabet[1])
         pair = np.hstack((first_class[randint(0,first_class.shape[0]),:],second_class[randint(0,second_class.shape[0]),:]))
         yield pair

points = np.fromiter(generate_points(data,labels,5),dtype = np.dtype('f8',(2*data.shape[1],1)))

The extract_one_class function returns a subset of data: all data points belonging to one class label. I would like to have points be an np.array with shape = (size,data.shape[1]). Currently the code snippet above returns an error:

ValueError: setting an array element with a sequence.

The documentation of fromiter claims to return a one-dimensional array. Yet others have used fromiter to construct record arrays in numpy before (e.g http://iam.al/post/21116450281/numpy-is-my-homeboy).

Am I off the mark in assuming I can generate an array in this fashion? Or is my numpy just not quite right?

Pierre GM · Answer 1 · 2012-09-18T09:43:52.853

As you've noticed, the documentation of np.fromiter explains that the function creates a 1D array. You won't be able to create a 2D array that way, and @unutbu method of returning a 1D array that you reshape afterwards is a sure go.

However, you can indeed create structured arrays using fromiter, as illustrated by:

>>> import itertools
>>> a = itertools.izip((1,2,3),(10,20,30))
>>> r = np.fromiter(a,dtype=[('',int),('',int)])
array([(1, 10), (2, 20), (3, 30)], 
      dtype=[('f0', '<i8'), ('f1', '<i8')])

but look, r.shape=(3,), that is, r is really nothing but 1D array of records, each record being composed of two integers. Because all the fields have the same dtype, we can take a view of r as a 2D array

>>> r.view((int,2))
array([[ 1, 10],
       [ 2, 20],
       [ 3, 30]])

So, yes, you could try to use np.fromiter with a dtype like [('',int)]*data.shape[1]: you'll get a 1D array of length size, that you can then view this array as ((int, data.shape[1])). You can use floats instead of ints, the important part is that all fields have the same dtype.

If you really want it, you can use some fairly complex dtype. Consider for example

r = np.fromiter(((_,) for _ in a),dtype=[('',(int,2))])

Here, you get a 1D structured array with 1 field, the field consisting of an array of 2 integers. Note the use of (_,) to make sure that each record is passed as a tuple (else np.fromiter chokes). But do you need that complexity?

Note also that as you know the length of the array beforehand (it's size), you should use the counter optional argument of np.fromiter for more efficiency.

He has a secondary problem of returning a 2D array, but `np.frombuffer` needs single elements (possible 1D arrays when it gets packed into a recarray). So he must modify his iterator with an inner loop. It might be nicer to just construct a list and use `concatenate`. — seberg, Sep 18 '12 at 09:03
@seberg You meant `fromiter`? But yes, in that case, it'd be easier to use `concatenate` on a list. As a side note, `recarray` is usually used only to refer to structured arrays that can access fields as attributes... — Pierre GM, Sep 18 '12 at 09:10

score 5 · Accepted Answer · answered Sep 17 '12 at 22:24

5

You could modify generate_points to yield single floats instead of np.arrays, use np.fromiter to form a 1D array, and then use .reshape(size, -1) to make it a 2D array.

points = np.fromiter(
    generate_points(data,labels,5)).reshape(size, -1)

answered Sep 17 '12 at 22:24

unutbu

842,883
184
1,785
1,677

I don't think there is a way to modify generate_points to yield single floats. The idea is to sample from a data set where the units are N dimensional vectors of floats. If I were to sample one dimension at a time, I'd have to still use a for loop in the calling code to see when a whole row had been sampled. The goal for me is to express this as one statement iterating over what's returned from the generator. – LeeZamparo Sep 18 '12 at 22:07
An N-dimensional array can be "converted" into a 1D array by using the `ravel()` method. You can then re-convert it into an N-dimensional array using the `reshape()` method. So in `generate_points` you could `for val in pair.ravel(): yield val`, pump that into `np.fromiter`, and use `reshape(size, -1)` on the other side to obtain your desired 2D array. I would post explicit code to show you what I mean, but the code you posted doesn't actually run (it raises an IndexError: invalid index) and I'm not seeing how to easily fix it. – unutbu Sep 18 '12 at 22:24
I've fixed the IndexError bug, and your ravel suggestion works well, thanks. – LeeZamparo Sep 19 '12 at 21:50

summentier · Answer 3 · 2015-06-04T14:56:47.370

Following some suggestions here, I came up with a fairly general drop-in replacement for numpy.fromiter() that satisfies the requirements of the OP:

import numpy as np
def fromiter(iterator, dtype, *shape):
    """Generalises `numpy.fromiter()` to multi-dimesional arrays.

    Instead of the number of elements, the parameter `shape` has to be given,
    which contains the shape of the output array. The first dimension may be
    `-1`, in which case it is inferred from the iterator.
    """
    res_shape = shape[1:]
    if not res_shape:  # Fallback to the "normal" fromiter in the 1-D case           
        return np.fromiter(iterator, dtype, shape[0])

    # This wrapping of the iterator is necessary because when used with the
    # field trick, np.fromiter does not enforce consistency of the shapes
    # returned with the '_' field and silently cuts additional elements.
    def shape_checker(iterator, res_shape):
        for value in iterator:
            if value.shape != res_shape:
                raise ValueError("shape of returned object %s does not match"
                                 " given shape %s" % (value.shape, res_shape))
            yield value,

    return np.fromiter(shape_checker(iterator, res_shape),
                       [("_", dtype, res_shape)], shape[0])["_"]

How to construct an np.array with fromiter

3 Answers3

Linked