How do I load a list of images into an array for each channel in Numpy?

Question

I want an array X of shape (n_samples,n_cols,n_rows,n_channels). I want an array y with a shape (n_sample,n_cols,n_rows,n_channels)

I have tried

import glob
from skimage import io, color
import numpy as np

def loadfunc(files)
    for fl in files:
        img = color.rgb2lab(io.imread(fl))
        L = img[:,:,:1]
        ab = img[:,:,1:]
        yield L,ab

X,y = np.fromiter(loadfunc(glob.glob('path/to/images/*.png')),float)

and I get this error: ValueError: setting an array element with a sequence.

I figure this must be a somewhat common operation - any time someone wants to load image data into an array in numpy so there must be something Im missing?

[`np.fromiter`](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.fromiter.html) takes an iterable that produces the elements of an array, *one array*. So your code is actually trying to make an array of 2 element tuples.... I'm trying to find a way to do what you are trying now... — Tadhg McDonald-Jensen, May 22 '16 at 21:23
Do you want `X` to contain the `L` arrays, and `y` the `ab` arrays? — unutbu, May 22 '16 at 21:47
**for any future viewer** that is trying to make multidimensional arrays with `fromiter` please refer to [How to construct an np.array with fromiter](http://stackoverflow.com/questions/12467743/how-to-construct-an-np-array-with-fromiter) (assuming the answer here doesn't help you) — Tadhg McDonald-Jensen, May 22 '16 at 23:38
The compound `dtype` described in one of those answers will handle the tuple of arrays produced by `loadfunc`. — hpaulj, May 23 '16 at 03:23

score 1 · Answer 1 · answered May 22 '16 at 21:32

1

numpy.fromiter does not support simultaneous array creation and then return them as a tuple, (to be unpacked into X,y) It is possible there is a way to do this in np but not to my knowledge, you may need to split the iterator into a tee instead

# the built in map in python 3 uses iteration,
# uncomment the added imap import if you are using python 2
from itertools import tee #, imap as map

from operator import itemgetter

iter_a, iter_b = tee(loadfunc(glob.glob('path/to/images/*.png')))

X = np.fromiter(map(itemgetter(0),iter_a), float) #array from the first elements
y = np.fromiter(map(itemgetter(1),iter_b), float) #array from the second elements

answered May 22 '16 at 21:32

Tadhg McDonald-Jensen

20,699
5
35
59

hmm so would loadfunc still yield X and y in each iteration like the function I have in my question? – BigBoy1337 May 22 '16 at 21:36
yes, I was making an assumption that `loadfunc` might actually be a lot bigger / complicated so I suggested an edit that doesn't require rewriting it at all, you could also rework `loadfunc` to only produce one or the other depending on an additional argument. And then call it twice. – Tadhg McDonald-Jensen May 22 '16 at 21:39
hmm can you run this successfully? I am still getting this when I try and run this: ValueError: setting an array element with a sequence. I believe it is on the last 2 lines for setting X and y – BigBoy1337 May 22 '16 at 22:45
I wasn't because `color.rgb2lab(io.imread(fl))` kept failing with the images I was testing but I found one that worked and yes I got the same error, actually I got the same error without any special stuff, just `yield L` and `X = np.fromiter(iter_a, float)` so I tried searching "numpy fromiter 2d" and found [this very relevent question](http://stackoverflow.com/questions/12467743/how-to-construct-an-np-array-with-fromiter) – Tadhg McDonald-Jensen May 22 '16 at 23:02

score 1 · Accepted Answer · edited May 23 '17 at 12:31

np.fromiter requires that you state the dtype. If you use dtype=float, then each value from the iterable must be a float. If you yield single NumPy arrays from loadfunc, you could use their flat attribute to obtain iterators over the flatten array values which could be concatenated with itertools.chain.from_iterable and then passed to np.fromiter:

def loadfunc(files):
    for fl in files:
        img = skcolor.rgb2lab(skio.imread(fl)[..., :3])
        yield img

arrs = loadfunc(files)
Z = np.fromiter(IT.chain.from_iterable([arr.flat for arr in arrs]), dtype=float)

Since np.fromiter returns a 1D array, you would then need to reshape it:

Z = Z.reshape(len(files), h, w, n)

Note that this relies on each image having the same shape. Finally, to load the L values into X and the ab values into y:

X = Z[..., :1]
y = Z[..., 1:]

import glob
import itertools as IT
import numpy as np
import skimage.io as skio
import skimage.color as skcolor

def loadfunc(files):
    for fl in files:
        img = skcolor.rgb2lab(skio.imread(fl)[..., :3])
        yield img

files = glob.glob('path/to/images/*.png')
arrs = loadfunc(files)
first = next(arrs)
h, w, n = first.shape

Z = np.fromiter(IT.chain.from_iterable(
    [first.flat] + [arr.flat for arr in arrs]), dtype=float)
Z = Z.reshape(len(files), h, w, n)
X = Z[..., :1]
y = Z[..., 1:]

Regarding the question in the comments:

If I wanted to do extra processing to L and ab, where would I do that?

I believe in separating the loading from the processing of the data. By keeping the two functions distinct, you leave open the possibility of passing different data from different sources to the same processing function. If you put both the loading and the processing of the data (such as a KNN classification of the ab values) into loadfunc then there is no way to reuse the KNN classification code without loading the data from files.

If you allow us to change the order of the axes from (n_samples, n_cols, n_rows, n_channels) to (n_cols, n_rows, n_channels, n_samples), then the code could be simplified using np.stack:

import glob
import numpy as np
import skimage.io as skio
import skimage.color as skcolor

def loadfunc(files):
    for fl in files:
        img = skcolor.rgb2lab(skio.imread(fl)[..., :3])
        yield img

files = glob.glob('path/to/images/*.png')
Z = np.stack(loadfunc(files), axis=-1)
X = Z[..., :1, :]
Y = Z[..., 1:, :]

This code is simpler and therefore preferable to the code (using np.fromiter) above.

If I wanted to do extra processing to L and ab, where would I do that? For instance, I would like to perform a KNN classification on the ab values. Computationally, does it make more sense to do what you suggested and then pull Y apart and form an additional array from the classifications? Or does it make more sense to do that before Y is created? If I wanted to do this, where would I - something like @Tadhg McDonald-Jensen suggests below? — BigBoy1337, May 22 '16 at 22:37
I assume the above comment was referring to [this comment](http://stackoverflow.com/questions/37379696/how-do-i-load-a-list-of-images-into-an-array-for-each-channel-in-numpy/37379875?noredirect=1#comment62270895_37379875) which is actually _above_ here in my browser. — Tadhg McDonald-Jensen, May 22 '16 at 22:52
@BigBoy1337: I think it is preferable to separate the loading from the processing of the data. I add a few words explaining why, above. — unutbu, May 22 '16 at 23:24

score 0 · Answer 3 · edited May 23 '17 at 12:31

Usually when we create an array with iteration we either collect the values in a list, and create the array from that. Or we allocate an empty list and assign values to slots.

Here's a way of doing the assignment, where the generator returns a tuple of arrays:

def mk_array(N):
    for i in range(N):
        img=np.ones((2,3,3),int)
        L=img[:,:,:1]*i
        ab=img[:,:,1:].astype(float)*i/10
        yield L,ab

I made one an array of ints, the other an array of floats. That reduces the temptation to concatenate them into one.

In [157]: g=mk_array(4)

In [158]: for i,v in enumerate(g):
    print(v[0].shape,v[1].shape)
   .....:     
(2, 3, 1) (2, 3, 2)
(2, 3, 1) (2, 3, 2)
(2, 3, 1) (2, 3, 2)
(2, 3, 1) (2, 3, 2)

Lets allocate target arrays of the right shape; here I put the iteration axis 3rd, but it could be anywhere

In [159]: L, ab = np.empty((2,3,4,1),int), np.empty((2,3,4,2),float)

In [160]: for i,v in enumerate(g):
    L[...,i,:], ab[...,i,:] = v

My guess this is as fast as any fromiter or stack alternative. And when the components are generated by reading from files, that step is bound to be the most expensive - more so than the iteration mechanism or array copies.

================

If the iterator returned a tuple of scalars, we can use fromiter:

def mk_array1(N):
    for i in range(N):
        img=np.ones((2,3,3),int)
        L=img[:,:,:1]*i
        ab=img[:,:,1:].astype(float)*i/10
        for i,j in zip(L.ravel(),ab.ravel()):
            yield i,j

In [184]: g=mk_array1(2)

In [185]: V=np.fromiter(g,dtype=('i,f'))

producing a 1d structured array:

In [186]: V
Out[186]: 
array([(0, 0.0), (0, 0.0), (0, 0.0), (0, 0.0), (0, 0.0), (0, 0.0),
       (1, 0.10000000149011612), (1, 0.10000000149011612),
       (1, 0.10000000149011612), (1, 0.10000000149011612),
       (1, 0.10000000149011612), (1, 0.10000000149011612)], 
      dtype=[('f0', '<i4'), ('f1', '<f4')])

which can be reshaped, and arrays separated by field name:

In [187]: V['f0']
Out[187]: array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], dtype=int32)

In [188]: V.reshape(2,2,3)['f0']
Out[188]: 
array([[[0, 0, 0],
        [0, 0, 0]],

       [[1, 1, 1],
        [1, 1, 1]]], dtype=int32)

In [189]: V.reshape(2,2,3)['f1']
Out[189]: 
array([[[ 0. ,  0. ,  0. ],
        [ 0. ,  0. ,  0. ]],

       [[ 0.1,  0.1,  0.1],
        [ 0.1,  0.1,  0.1]]], dtype=float32)

================

What if I define a more complex dtype, one where each field has an array:

In [200]: dt=np.dtype([('f0',int,(2,3,1)),('f1',float,(2,3,2))])

In [201]: g=mk_array(2)   # the original generator

In [202]: V=np.fromiter(g,dtype=dt)

In [203]: V['f0']
Out[203]: 
array([[[[0],
         [0],
         [0]],
        ....

        [[1],
         [1],
         [1]]]])

In [204]: _.shape
Out[204]: (2, 2, 3, 1)

This use of a compound dtype with fromiter is also described in https://stackoverflow.com/a/12473478/901925

This is, in effect, a variation on the usual way of building a structured array - from a list of tuples. More than once I've use the expression:

np.array([tuple(x)  for x in something], dtype=dt)

In sum we can time two methods of creating 2 arrays:

def foo1(N):
    g = mk_array(N)                                       
    L, ab = np.empty((N,2,3,1),int), np.empty((N,2,3,2),float)
    for i,v in enumerate(g):
        L[i,...], ab[i,...] = v
    return L, ab

def foo2(N):
    dt=np.dtype([('f0',int,(2,3,1)),('f1',float,(2,3,2))])
    g = mk_array(N)
    V=np.fromiter(g, dtype=dt)
    return V['f0'], V['f1']

For a wide range of N these 2 functions take nearly the same time. I have to push run times to 1s before I starting an advantage for foo1.

How do I load a list of images into an array for each channel in Numpy?

3 Answers3