Concatenate many nd-arrays of different shapes (filling values until the edges)

Question

I have a list of 2d arrays of different shapes: lst (see the example below).

I would like to concatenate them into a 3d array of shape (len(lst), maxx, maxy), where maxx is the maximum .shape[0] of all arrays, and maxy is the maximum .shape[1].

If an array's shape is smaller than (maxx, maxy), then this array should start at the top left corner, and all the missing values should be filled with some value of choice (e.g. 0 or np.nan).

Example:

lst = [np.array([[1, 2],
                 [3, 4]]),
       np.array([[1, 2, 3],
                 [4, 5, 6]]),
       np.array([[1, 2],
                 [3, 4],
                 [5, 6]])]

# maxx == 3
# maxy == 3

result = np.array([[[1, 2, 0],
                    [3, 4, 0],
                    [0, 0, 0]],

                    [[1, 2, 3],
                     [4, 5, 6],
                     [0, 0, 0]],

                    [[1, 2, 0],
                     [3, 4, 0],
                     [5, 6, 0]]])

Notes:

np.concatenate requires the shapes of all arrays to match.

This question is similar - but it is only for 1d arrays.

A sub-problem:

As a special case, you may assume that .shape[1] == maxy is the same for all arrays. For example:

lst = [np.array([[1, 2, 3],
                 [4, 5, 6]]),
       np.array([[1, 2, 3]]),
       np.array([[1, 2, 3],
                 [4, 5, 6]
                 [7, 8, 9]])]

Bonus (a hard question):

Can this be applied to more dimensions? E.g., while concatenating 3d arrays into a 4d array, all 3d arrays (rectangular parallelepipeds) will start at the same corner, and if their shapes are too small - the missing values (until the edges) will be filled with 0 or np.nan.

How to do this at all? How to do this efficiently (potentially for thousands of arrays, each with thousands of elements)?

Maybe creating an array of the final shape and filling it somehow in a vectored way?
Or converting all arrays into dataframes and concatenating them with pd.concat?
Maybe SciPy has some helpful functions for this?

Michael Szczesny · Accepted Answer · 2022-08-30T06:25:14.473

A solution for general dimensions, non-vectorized but avoiding a slow np.pad call. (~20x faster, benchmarked with the example lst * 10000).

import numpy as np

def fill_axis(lst):
    shapes = np.array([arr.shape for arr in lst])
    res = np.zeros((len(shapes),) + (*shapes.max(0),), int)
    for x, arr in enumerate(lst):
        slices = [x]
        slices += (slice(None, shape) for shape in arr.shape)
        res[tuple(slices)] = arr
    return res

lst = lst * 10000

%timeit fill_axis(lst)
# 77.3 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# solution by @TimRoberts https://stackoverflow.com/a/73536898/14277722
def pad_fill(lst):
    maxx = max(x.shape[0] for x in lst)
    maxy = max(x.shape[1] for x in lst)

    res = [np.pad( k, [(0,maxx-k.shape[0]),(0,maxy-k.shape[1])] ) for k in lst]
    return np.array(res)

%timeit pad_fill(lst)
# 1.82 s ± 82.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

np.testing.assert_equal(pad_fill(lst), fill_axis(lst))

An example with stacked 3D arrays

lst_4D = [np.arange(1*2*3*3).reshape(1,2,3,3),
          np.arange(2*3*2*2).reshape(2,3,2,2)]

fill_axis(lst_4D)

Output

array([[[[[ 0,  1,  2],
          [ 3,  4,  5],
          [ 6,  7,  8]],

         [[ 9, 10, 11],
          [12, 13, 14],
          [15, 16, 17]],

         [[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]]],


        [[[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]],

         [[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]],

         [[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]]]],



       [[[[ 0,  1,  0],
          [ 2,  3,  0],
          [ 0,  0,  0]],

         [[ 4,  5,  0],
          [ 6,  7,  0],
          [ 0,  0,  0]],

         [[ 8,  9,  0],
          [10, 11,  0],
          [ 0,  0,  0]]],


        [[[12, 13,  0],
          [14, 15,  0],
          [ 0,  0,  0]],

         [[16, 17,  0],
          [18, 19,  0],
          [ 0,  0,  0]],

         [[20, 21,  0],
          [22, 23,  0],
          [ 0,  0,  0]]]]])

An adaptation of @divakar's solution for 2D arrays. ~2x faster with larger lists than the more general solution in my benchmarks, but harder to generalize to more dimensions.

def einsum_fill(lst):
    shapes = np.array([arr.shape for arr in lst])
    a = np.arange(shapes[:,0].max()) < shapes[:,[0]]
    b = np.arange(shapes[:,1].max()) < shapes[:,[1]]
    mask = np.einsum('ij,ik->ijk', a, b)
    res = np.zeros_like(mask, int)
    res[mask] = np.concatenate([arr.ravel() for arr in lst])
    return res
                 
%timeit einsum_fill(lst)
# 46.7 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

np.testing.assert_equal(einsum_fill(lst), fill_axis(lst))

Thanks! Your function `fill_axis` _just works_ correctly if you simply pass the arrays that have more dimensions! — Vladimir Fokow, Aug 30 '22 at 10:41

score 1 · Answer 2 · answered Aug 30 '22 at 03:42

You can use numpy.pad to do this.

import numpy as np

lst = [np.array([[1, 2],
                 [3, 4]]),
       np.array([[1, 2, 3],
                 [4, 5, 6]]),
       np.array([[1, 2],
                 [3, 4],
                 [5, 6]])]

maxx = max(x.shape[0] for x in lst)
maxy = max(x.shape[1] for x in lst)

lst = [np.pad( k, [(0,maxx-k.shape[0]),(0,maxy-k.shape[1])] ) for k in lst]
print(lst)

Output:

[array([[1, 2, 0],
       [3, 4, 0],
       [0, 0, 0]]),
 array([[1, 2, 3],
       [4, 5, 6],
       [0, 0, 0]]),
 array([[1, 2, 0],
       [3, 4, 0],
       [5, 6, 0]])]

This process will work with any number of dimensions. You'd have to use a loop instead of the maxx/maxy computation.

Thanks. `numpy.pad` is a very useful function here. And I can convert this output to a numpy array easily. But I am still interested to see the comparison of different methods in terms of speed on large arrays — Vladimir Fokow, Aug 30 '22 at 03:53

Ali_Sh · Answer 3 · 2022-08-31T09:05:01.147

1

The following code can be ran in 44 ms for example with lst * 10000:

def new_(lst):

    maxx = max(x.shape[0] for x in lst)
    maxy = max(x.shape[1] for x in lst)

    arr = np.zeros((len(lst), maxx, maxy))
    for i in range(len(lst)):
        arr[i, :lst[i].shape[0], :lst[i].shape[1]] = lst[i]
    return arr

Which can be accelerated by numba as:

lst_nb = nb.typed.List(lst)

@nb.njit(nb.float64[:, :, :](nb.types.ListType(nb.int_[:, ::1])))
def numba_(lst):

    maxx = 0
    maxy = 0
    for x in lst:
        maxx = max(x.shape[0], maxx)
        maxy = max(x.shape[1], maxy)

    arr = np.zeros((len(lst), maxx, maxy))
    for i in range(len(lst)):
        arr[i, :lst[i].shape[0], :lst[i].shape[1]] = lst[i]
    return arr

edited Aug 31 '22 at 09:05

answered Aug 30 '22 at 08:16

Ali_Sh

2,667
3
43
66

1

Thanks! This works for 2d arrays, and if you replace the way you calculate `maxx` and `maxy` to the way like TimRoberts does: `maxx = max(x.shape[0] for x in lst)`, then your function is as fast as `einsum_fill` – Vladimir Fokow Aug 30 '22 at 13:42
@VladimirFokow, I didn't check that and don't know the reason. However, its performance is near `einsum_fill`. I think this function could be faster more, but I am busy for a day; I will come back and check. – Ali_Sh Aug 30 '22 at 17:02
I think the reason is that the list comprehension is faster than a for-loop – Vladimir Fokow Aug 30 '22 at 17:04
@VladimirFokow, you are right, using it make it faster even than `einsum_fill` e.g. if `lst = lst * 50000`. However, I will try to improve more if I could. – Ali_Sh Aug 30 '22 at 17:14
Okay! Is the fully vectorized solution even possible? (without **any** for loops and comprehensions) – Vladimir Fokow Aug 30 '22 at 17:16
I don't know and I don't think so. – Ali_Sh Aug 30 '22 at 17:18
1

@VladimirFokow I have added a numba version with a little inquiry, you can test it. – Ali_Sh Aug 31 '22 at 08:45
@VladimirFokow, It seems numba is much faster (>10 times) than the previous code. I didn't consider `lst_nb = nb.typed.List(lst)` in timings before, and it was at least 20 to 30 times faster as I remember. Please, send me the feedback if you test its performance. – Ali_Sh Aug 31 '22 at 18:51

Concatenate many nd-arrays of different shapes (filling values until the edges)

Example:

A sub-problem:

Bonus (a hard question):

3 Answers3