1

I have a list of 2d arrays of different shapes: lst (see the example below).

I would like to concatenate them into a 3d array of shape (len(lst), maxx, maxy), where maxx is the maximum .shape[0] of all arrays, and maxy is the maximum .shape[1].

If an array's shape is smaller than (maxx, maxy), then this array should start at the top left corner, and all the missing values should be filled with some value of choice (e.g. 0 or np.nan).

Example:

lst = [np.array([[1, 2],
                 [3, 4]]),
       np.array([[1, 2, 3],
                 [4, 5, 6]]),
       np.array([[1, 2],
                 [3, 4],
                 [5, 6]])]

# maxx == 3
# maxy == 3

result = np.array([[[1, 2, 0],
                    [3, 4, 0],
                    [0, 0, 0]],

                    [[1, 2, 3],
                     [4, 5, 6],
                     [0, 0, 0]],

                    [[1, 2, 0],
                     [3, 4, 0],
                     [5, 6, 0]]])

Notes:

np.concatenate requires the shapes of all arrays to match.

This question is similar - but it is only for 1d arrays.


A sub-problem:

As a special case, you may assume that .shape[1] == maxy is the same for all arrays. For example:

lst = [np.array([[1, 2, 3],
                 [4, 5, 6]]),
       np.array([[1, 2, 3]]),
       np.array([[1, 2, 3],
                 [4, 5, 6]
                 [7, 8, 9]])]

Bonus (a hard question):

Can this be applied to more dimensions? E.g., while concatenating 3d arrays into a 4d array, all 3d arrays (rectangular parallelepipeds) will start at the same corner, and if their shapes are too small - the missing values (until the edges) will be filled with 0 or np.nan.

 


How to do this at all? How to do this efficiently (potentially for thousands of arrays, each with thousands of elements)?

  • Maybe creating an array of the final shape and filling it somehow in a vectored way?

  • Or converting all arrays into dataframes and concatenating them with pd.concat?

  • Maybe SciPy has some helpful functions for this?

Vladimir Fokow
  • 3,728
  • 2
  • 5
  • 27

3 Answers3

2

A solution for general dimensions, non-vectorized but avoiding a slow np.pad call. (~20x faster, benchmarked with the example lst * 10000).

import numpy as np

def fill_axis(lst):
    shapes = np.array([arr.shape for arr in lst])
    res = np.zeros((len(shapes),) + (*shapes.max(0),), int)
    for x, arr in enumerate(lst):
        slices = [x]
        slices += (slice(None, shape) for shape in arr.shape)
        res[tuple(slices)] = arr
    return res

lst = lst * 10000

%timeit fill_axis(lst)
# 77.3 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# solution by @TimRoberts https://stackoverflow.com/a/73536898/14277722
def pad_fill(lst):
    maxx = max(x.shape[0] for x in lst)
    maxy = max(x.shape[1] for x in lst)

    res = [np.pad( k, [(0,maxx-k.shape[0]),(0,maxy-k.shape[1])] ) for k in lst]
    return np.array(res)

%timeit pad_fill(lst)
# 1.82 s ± 82.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

np.testing.assert_equal(pad_fill(lst), fill_axis(lst))

An example with stacked 3D arrays

lst_4D = [np.arange(1*2*3*3).reshape(1,2,3,3),
          np.arange(2*3*2*2).reshape(2,3,2,2)]

fill_axis(lst_4D)

Output

array([[[[[ 0,  1,  2],
          [ 3,  4,  5],
          [ 6,  7,  8]],

         [[ 9, 10, 11],
          [12, 13, 14],
          [15, 16, 17]],

         [[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]]],


        [[[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]],

         [[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]],

         [[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]]]],



       [[[[ 0,  1,  0],
          [ 2,  3,  0],
          [ 0,  0,  0]],

         [[ 4,  5,  0],
          [ 6,  7,  0],
          [ 0,  0,  0]],

         [[ 8,  9,  0],
          [10, 11,  0],
          [ 0,  0,  0]]],


        [[[12, 13,  0],
          [14, 15,  0],
          [ 0,  0,  0]],

         [[16, 17,  0],
          [18, 19,  0],
          [ 0,  0,  0]],

         [[20, 21,  0],
          [22, 23,  0],
          [ 0,  0,  0]]]]])

An adaptation of @divakar's solution for 2D arrays. ~2x faster with larger lists than the more general solution in my benchmarks, but harder to generalize to more dimensions.

def einsum_fill(lst):
    shapes = np.array([arr.shape for arr in lst])
    a = np.arange(shapes[:,0].max()) < shapes[:,[0]]
    b = np.arange(shapes[:,1].max()) < shapes[:,[1]]
    mask = np.einsum('ij,ik->ijk', a, b)
    res = np.zeros_like(mask, int)
    res[mask] = np.concatenate([arr.ravel() for arr in lst])
    return res
                 
%timeit einsum_fill(lst)
# 46.7 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

np.testing.assert_equal(einsum_fill(lst), fill_axis(lst))
Michael Szczesny
  • 4,911
  • 5
  • 15
  • 32
1

You can use numpy.pad to do this.

import numpy as np

lst = [np.array([[1, 2],
                 [3, 4]]),
       np.array([[1, 2, 3],
                 [4, 5, 6]]),
       np.array([[1, 2],
                 [3, 4],
                 [5, 6]])]

maxx = max(x.shape[0] for x in lst)
maxy = max(x.shape[1] for x in lst)

lst = [np.pad( k, [(0,maxx-k.shape[0]),(0,maxy-k.shape[1])] ) for k in lst]
print(lst)

Output:

[array([[1, 2, 0],
       [3, 4, 0],
       [0, 0, 0]]),
 array([[1, 2, 3],
       [4, 5, 6],
       [0, 0, 0]]),
 array([[1, 2, 0],
       [3, 4, 0],
       [5, 6, 0]])]

This process will work with any number of dimensions. You'd have to use a loop instead of the maxx/maxy computation.

Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
  • Thanks. `numpy.pad` is a very useful function here. And I can convert this output to a numpy array easily. But I am still interested to see the comparison of different methods in terms of speed on large arrays – Vladimir Fokow Aug 30 '22 at 03:53
1

The following code can be ran in 44 ms for example with lst * 10000:

def new_(lst):

    maxx = max(x.shape[0] for x in lst)
    maxy = max(x.shape[1] for x in lst)

    arr = np.zeros((len(lst), maxx, maxy))
    for i in range(len(lst)):
        arr[i, :lst[i].shape[0], :lst[i].shape[1]] = lst[i]
    return arr

Which can be accelerated by numba as:

lst_nb = nb.typed.List(lst)

@nb.njit(nb.float64[:, :, :](nb.types.ListType(nb.int_[:, ::1])))
def numba_(lst):

    maxx = 0
    maxy = 0
    for x in lst:
        maxx = max(x.shape[0], maxx)
        maxy = max(x.shape[1], maxy)

    arr = np.zeros((len(lst), maxx, maxy))
    for i in range(len(lst)):
        arr[i, :lst[i].shape[0], :lst[i].shape[1]] = lst[i]
    return arr
Ali_Sh
  • 2,667
  • 3
  • 43
  • 66
  • 1
    Thanks! This works for 2d arrays, and if you replace the way you calculate `maxx` and `maxy` to the way like TimRoberts does: `maxx = max(x.shape[0] for x in lst)`, then your function is as fast as `einsum_fill` – Vladimir Fokow Aug 30 '22 at 13:42
  • @VladimirFokow, I didn't check that and don't know the reason. However, its performance is near `einsum_fill`. I think this function could be faster more, but I am busy for a day; I will come back and check. – Ali_Sh Aug 30 '22 at 17:02
  • I think the reason is that the list comprehension is faster than a for-loop – Vladimir Fokow Aug 30 '22 at 17:04
  • @VladimirFokow, you are right, using it make it faster even than `einsum_fill` e.g. if `lst = lst * 50000`. However, I will try to improve more if I could. – Ali_Sh Aug 30 '22 at 17:14
  • Okay! Is the fully vectorized solution even possible? (without **any** for loops and comprehensions) – Vladimir Fokow Aug 30 '22 at 17:16
  • I don't know and I don't think so. – Ali_Sh Aug 30 '22 at 17:18
  • 1
    @VladimirFokow I have added a numba version with a little inquiry, you can test it. – Ali_Sh Aug 31 '22 at 08:45
  • @VladimirFokow, It seems numba is much faster (>10 times) than the previous code. I didn't consider `lst_nb = nb.typed.List(lst)` in timings before, and it was at least 20 to 30 times faster as I remember. Please, send me the feedback if you test its performance. – Ali_Sh Aug 31 '22 at 18:51