36

The functionality I am looking for looks something like this:

data = np.array([[1, 2, 3, 4],
                 [2, 3, 1],
                 [5, 5, 5, 5],
                 [1, 1]])

result = fix(data)
print result

[[ 1.  2.  3.  4.]
 [ 2.  3.  1.  0.]
 [ 5.  5.  5.  5.]
 [ 1.  1.  0.  0.]]

These data arrays I'm working with are really large so I would really appreciate the most efficient solution.

Edit: Data is read in from disk as a python list of lists.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
user2909415
  • 979
  • 3
  • 10
  • 26
  • simply add the data type to the array function call, `np.array(...,dtype=np.float64)np.array(...,dtype=np.float64)`, or use `loadtxt`, `savetxt` from numpy. – nickpapior Aug 16 '15 at 17:33
  • 1
    @zeroth I have tried that and got ValueError: setting an array element with a sequence. Could you explain more? – user2909415 Aug 16 '15 at 17:36
  • 1
    Is it likely to be a Sparse matrix with most entries as zero? Can it fit in memory as a dense matrix? – musically_ut Aug 16 '15 at 17:54
  • @musically_ut No it isn't sparse. Often there are only 1-3 elements missing at the ends. – user2909415 Aug 16 '15 at 18:10
  • @user2909415 You should add that information to the question. And while you are at it, do you know the size (both height and width) of the matrix before you read in the file (for preallocation)? If you know at least the width, then perhaps tweaking the file to contain the correct number of entries and using `np.loadtxt` will be the fastest option. – musically_ut Aug 16 '15 at 18:13
  • @musically_ut By width, do you mean the maximum length of a row in the data? Such as in the example it would be 4. – user2909415 Aug 16 '15 at 18:16
  • Yes, I meant the number of columns in the matrix. – musically_ut Aug 16 '15 at 18:21
  • 1
    This is relevant: http://stackoverflow.com/questions/27890052/convert-and-pad-a-list-to-numpy-array – Alex Riley Aug 16 '15 at 19:05

5 Answers5

27

This could be one approach -

def numpy_fillna(data):
    # Get lengths of each row of data
    lens = np.array([len(i) for i in data])

    # Mask of valid places in each row
    mask = np.arange(lens.max()) < lens[:,None]

    # Setup output array and put elements from data into masked positions
    out = np.zeros(mask.shape, dtype=data.dtype)
    out[mask] = np.concatenate(data)
    return out

Sample input, output -

In [222]: # Input object dtype array
     ...: data = np.array([[1, 2, 3, 4],
     ...:                  [2, 3, 1],
     ...:                  [5, 5, 5, 5, 8 ,9 ,5],
     ...:                  [1, 1]])

In [223]: numpy_fillna(data)
Out[223]: 
array([[1, 2, 3, 4, 0, 0, 0],
       [2, 3, 1, 0, 0, 0, 0],
       [5, 5, 5, 5, 8, 9, 5],
       [1, 1, 0, 0, 0, 0, 0]], dtype=object)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • I think `lens.size` should be `lens.max()` - in your answer these are equal to make a square matrix. But try with a ragged row longer than the number of rows and you will get an error. – Neil Slater Nov 14 '16 at 12:58
  • The accepted answer is almost correct. I assume it was an oversight, but the following: # Mask of valid places in each row mask = np.arange(lens.size) < lens[:,None] Should Actually be: # Mask of valid places in each row mask = np.arange(max(lens)) < lens[:,None] The accepted answer happens to work for the tested input because `lens.size == max(lens)`. If it's not, it no longer works... – Jerry Londergaard Oct 09 '16 at 23:32
15

You could use pandas instead of numpy:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([[1, 2, 3, 4],
   ...:                    [2, 3, 1],
   ...:                    [5, 5, 5, 5],
   ...:                    [1, 1]], dtype=float)


In [3]: df.fillna(0.0).values
Out[3]: 
array([[ 1.,  2.,  3.,  4.],
       [ 2.,  3.,  1.,  0.],
       [ 5.,  5.,  5.,  5.],
       [ 1.,  1.,  0.,  0.]])
Eastsun
  • 18,526
  • 6
  • 57
  • 81
11

use np.pad().

In [62]: arr
Out[62]: 
[array([0]),
 array([83, 74]),
 array([87, 61, 23]),
 array([71,  3, 81, 77]),
 array([20, 44, 20, 53, 60]),
 array([54, 36, 74, 35, 49, 54]),
 array([11, 36,  0, 98, 29, 87, 21]),
 array([ 1, 22, 62, 51, 45, 40, 36, 86]),
 array([ 7, 22, 83, 58, 43, 59, 45, 81, 92]),
 array([68, 78, 70, 67, 77, 64, 58, 88, 13, 56])]

In [63]: max_len = np.max([len(a) for a in arr])

In [64]: np.asarray([np.pad(a, (0, max_len - len(a)), 'constant', constant_values=0) for a in arr])
Out[64]: 
array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [83, 74,  0,  0,  0,  0,  0,  0,  0,  0],
       [87, 61, 23,  0,  0,  0,  0,  0,  0,  0],
       [71,  3, 81, 77,  0,  0,  0,  0,  0,  0],
       [20, 44, 20, 53, 60,  0,  0,  0,  0,  0],
       [54, 36, 74, 35, 49, 54,  0,  0,  0,  0],
       [11, 36,  0, 98, 29, 87, 21,  0,  0,  0],
       [ 1, 22, 62, 51, 45, 40, 36, 86,  0,  0],
       [ 7, 22, 83, 58, 43, 59, 45, 81, 92,  0],
       [68, 78, 70, 67, 77, 64, 58, 88, 13, 56]])
陈家胜
  • 688
  • 6
  • 7
4

This would be nice if in some vectorized way, but Im still a NOOB, so its all I could think now!

import numpy as np,numba as nb
a=np.array([[1, 2, 3, 4],
                 [2, 3, 1],
                 [5, 5, 5, 5,5],
                 [1, 1]])
@nb.jit()
def f(a):
    l=len(max(a,key=len))
    a0=np.empty(a.shape+(l,))
    for n,i in enumerate(a.flat):
        a0[n]=np.pad(i,(0,l-len(i)),mode='constant')
    a=a0
    return a

print(f(a))
yourstruly
  • 972
  • 1
  • 9
  • 17
0
data = np.array([[1, 2, 3, 4],
                 [2, 3, 1],
                 [5, 5, 5, 5],
                 [1, 1]])
max_len=max([len(i) for i in data])
np.array([ np.pad(data[i],
           (0,max_len-len(data[i])),
           'constant',
            constant_values=0) for i in range(len(data))])

The lengths of the individual arrays are computed, then the maximum among these lengths is stored in a variable. After which all the individual rows of the matrix is padded with 0s on the right to match the maximum length.

General Grievance
  • 4,555
  • 31
  • 31
  • 45