Fix array with rows of different lengths by filling the empty elements with zeros

Question

The functionality I am looking for looks something like this:

data = np.array([[1, 2, 3, 4],
                 [2, 3, 1],
                 [5, 5, 5, 5],
                 [1, 1]])

result = fix(data)
print result

[[ 1.  2.  3.  4.]
 [ 2.  3.  1.  0.]
 [ 5.  5.  5.  5.]
 [ 1.  1.  0.  0.]]

These data arrays I'm working with are really large so I would really appreciate the most efficient solution.

Edit: Data is read in from disk as a python list of lists.

simply add the data type to the array function call, `np.array(...,dtype=np.float64)np.array(...,dtype=np.float64)`, or use `loadtxt`, `savetxt` from numpy. — nickpapior, Aug 16 '15 at 17:33
@zeroth I have tried that and got ValueError: setting an array element with a sequence. Could you explain more? — user2909415, Aug 16 '15 at 17:36
Is it likely to be a Sparse matrix with most entries as zero? Can it fit in memory as a dense matrix? — musically_ut, Aug 16 '15 at 17:54
@musically_ut No it isn't sparse. Often there are only 1-3 elements missing at the ends. — user2909415, Aug 16 '15 at 18:10
@user2909415 You should add that information to the question. And while you are at it, do you know the size (both height and width) of the matrix before you read in the file (for preallocation)? If you know at least the width, then perhaps tweaking the file to contain the correct number of entries and using `np.loadtxt` will be the fastest option. — musically_ut, Aug 16 '15 at 18:13
@musically_ut By width, do you mean the maximum length of a row in the data? Such as in the example it would be 4. — user2909415, Aug 16 '15 at 18:16
This is relevant: http://stackoverflow.com/questions/27890052/convert-and-pad-a-list-to-numpy-array — Alex Riley, Aug 16 '15 at 19:05

Divakar · Accepted Answer · 2016-11-14T13:07:03.197

27

This could be one approach -

def numpy_fillna(data):
    # Get lengths of each row of data
    lens = np.array([len(i) for i in data])

    # Mask of valid places in each row
    mask = np.arange(lens.max()) < lens[:,None]

    # Setup output array and put elements from data into masked positions
    out = np.zeros(mask.shape, dtype=data.dtype)
    out[mask] = np.concatenate(data)
    return out

Sample input, output -

In [222]: # Input object dtype array
     ...: data = np.array([[1, 2, 3, 4],
     ...:                  [2, 3, 1],
     ...:                  [5, 5, 5, 5, 8 ,9 ,5],
     ...:                  [1, 1]])

In [223]: numpy_fillna(data)
Out[223]: 
array([[1, 2, 3, 4, 0, 0, 0],
       [2, 3, 1, 0, 0, 0, 0],
       [5, 5, 5, 5, 8, 9, 5],
       [1, 1, 0, 0, 0, 0, 0]], dtype=object)

edited Nov 14 '16 at 13:07

answered Aug 17 '15 at 05:31

Divakar

218,885
19
262
358

I think `lens.size` should be `lens.max()` - in your answer these are equal to make a square matrix. But try with a ragged row longer than the number of rows and you will get an error. – Neil Slater Nov 14 '16 at 12:58
The accepted answer is almost correct. I assume it was an oversight, but the following: # Mask of valid places in each row mask = np.arange(lens.size) < lens[:,None] Should Actually be: # Mask of valid places in each row mask = np.arange(max(lens)) < lens[:,None] The accepted answer happens to work for the tested input because `lens.size == max(lens)`. If it's not, it no longer works... – Jerry Londergaard Oct 09 '16 at 23:32

score 15 · Answer 2 · answered Aug 17 '15 at 05:39

15

You could use pandas instead of numpy:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([[1, 2, 3, 4],
   ...:                    [2, 3, 1],
   ...:                    [5, 5, 5, 5],
   ...:                    [1, 1]], dtype=float)


In [3]: df.fillna(0.0).values
Out[3]: 
array([[ 1.,  2.,  3.,  4.],
       [ 2.,  3.,  1.,  0.],
       [ 5.,  5.,  5.,  5.],
       [ 1.,  1.,  0.,  0.]])

answered Aug 17 '15 at 05:39

Eastsun

18,526
6
57
81

Doesn't seem to work for deeper nesting levels, though :( – mat_dw Jan 26 '17 at 18:30

score 11 · Answer 3 · answered Jul 11 '17 at 13:45

use np.pad().

In [62]: arr
Out[62]: 
[array([0]),
 array([83, 74]),
 array([87, 61, 23]),
 array([71,  3, 81, 77]),
 array([20, 44, 20, 53, 60]),
 array([54, 36, 74, 35, 49, 54]),
 array([11, 36,  0, 98, 29, 87, 21]),
 array([ 1, 22, 62, 51, 45, 40, 36, 86]),
 array([ 7, 22, 83, 58, 43, 59, 45, 81, 92]),
 array([68, 78, 70, 67, 77, 64, 58, 88, 13, 56])]

In [63]: max_len = np.max([len(a) for a in arr])

In [64]: np.asarray([np.pad(a, (0, max_len - len(a)), 'constant', constant_values=0) for a in arr])
Out[64]: 
array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [83, 74,  0,  0,  0,  0,  0,  0,  0,  0],
       [87, 61, 23,  0,  0,  0,  0,  0,  0,  0],
       [71,  3, 81, 77,  0,  0,  0,  0,  0,  0],
       [20, 44, 20, 53, 60,  0,  0,  0,  0,  0],
       [54, 36, 74, 35, 49, 54,  0,  0,  0,  0],
       [11, 36,  0, 98, 29, 87, 21,  0,  0,  0],
       [ 1, 22, 62, 51, 45, 40, 36, 86,  0,  0],
       [ 7, 22, 83, 58, 43, 59, 45, 81, 92,  0],
       [68, 78, 70, 67, 77, 64, 58, 88, 13, 56]])

score 4 · Answer 4 · answered Aug 16 '15 at 19:14

This would be nice if in some vectorized way, but Im still a NOOB, so its all I could think now!

import numpy as np,numba as nb
a=np.array([[1, 2, 3, 4],
                 [2, 3, 1],
                 [5, 5, 5, 5,5],
                 [1, 1]])
@nb.jit()
def f(a):
    l=len(max(a,key=len))
    a0=np.empty(a.shape+(l,))
    for n,i in enumerate(a.flat):
        a0[n]=np.pad(i,(0,l-len(i)),mode='constant')
    a=a0
    return a

print(f(a))

score 0 · Answer 5 · edited Sep 27 '20 at 02:13

data = np.array([[1, 2, 3, 4],
                 [2, 3, 1],
                 [5, 5, 5, 5],
                 [1, 1]])
max_len=max([len(i) for i in data])
np.array([ np.pad(data[i],
           (0,max_len-len(data[i])),
           'constant',
            constant_values=0) for i in range(len(data))])

The lengths of the individual arrays are computed, then the maximum among these lengths is stored in a variable. After which all the individual rows of the matrix is padded with 0s on the right to match the maximum length.

Fix array with rows of different lengths by filling the empty elements with zeros

5 Answers5

Linked

Related