0

I have an array that was created from lists of varying lengths. I do not know the length of the lists beforehand which is why I was using lists instead of arrays.

Here's a reproducible code for the purpose of this question:

a = []

for i in np.arange(5):
    a += [np.random.rand(np.random.randint(1,6))]

a = np.array(a)

Is there a more efficient way, than the following, to convert this array into a well structured array with the rows being the same size with NaNs?

max_len_of_array = 0
for aa in a:
    len_of_array = aa.shape[0]
    if len_of_array > max_len_of_array:
        max_len_of_array = len_of_array
max_len_of_array

n = a.shape[0]

A = np.zeros((n, max_len_of_array)) * np.nan
for i, aa in enumerate(zip(a)):
    A[i][:aa[0].shape[0]] = aa[0]

A
user10853
  • 302
  • 1
  • 4
  • 17
  • 1
    Can you keep track of `max_len_of_array` when you are filling the original list? Otherwise your approach seems reasonable. – nalyd88 Sep 17 '17 at 23:48
  • @nalyd88 yes it is possible but I am creating around 10 such arrays. I guess I could use an array for the `max_len_of_array`. – user10853 Sep 17 '17 at 23:54
  • @DYZ I don't see how this relates to my question. Please clarify if you do. – user10853 Sep 18 '17 at 00:03
  • 1
    [Here](https://stackoverflow.com/questions/38619143/convert-python-sequence-to-numpy-array-filling-missing-values)'s a related question. – ayhan Sep 18 '17 at 00:32
  • Just as a warning, around here we often use `structured array` for an array with a compound dtype. What you want is a `nan` padded rectangular or regular numeric (float) array. Without the padding `np.array(yourlist)` would produce a 1d object dtype array (an 'irregular` or ragged array). – hpaulj Sep 18 '17 at 03:21

2 Answers2

3

Here is a slightly faster version of your code:

def alt(a):
    A = np.full((len(a), max(map(len, a))), np.nan)
    for i, aa in enumerate(a):
        A[i, :len(aa)] = aa
    return A

The for-loops are unavoidable. Given that a is a Python list, there is no getting around the need to iterate through the items in the list. Sometimes the loop can be hidden (behind calls to max and map for instance) but speed-wise they are essentially equivalent to Python loops.


Here is a benchmark using a with resultant shape (100, 100):

In [197]: %timeit orig(a)
10000 loops, best of 3: 125 µs per loop

In [198]: %timeit alt(a)
10000 loops, best of 3: 84.1 µs per loop

In [199]: %timeit using_pandas(a)
100 loops, best of 3: 4.8 ms per loop

This was the setup used for the benchmark:

import numpy as np
import pandas as pd

def make_array(h, w):
    a = []
    for i in np.arange(h):
        a += [np.random.rand(np.random.randint(1,w+1))]
    a = np.array(a)
    return a

def orig(a):
    max_len_of_array = 0

    for aa in a:
        len_of_array = aa.shape[0]
        if len_of_array > max_len_of_array:
            max_len_of_array = len_of_array

    n = a.shape[0]

    A = np.zeros((n, max_len_of_array)) * np.nan
    for i, aa in enumerate(zip(a)):
        A[i][:aa[0].shape[0]] = aa[0]

    return A

def alt(a):
    A = np.full((len(a), max(map(len, a))), np.nan)
    for i, aa in enumerate(a):
        A[i, :len(aa)] = aa
    return A

def using_pandas(a):
    return pd.DataFrame.from_records(a).values

a = make_array(100,100)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
0

I suppose you can use pandas as a one-time solution, but it's going to be very inefficient, like everything pandas:

pd.DataFrame(a)[0].apply(pd.Series).values
#array([[ 0.28669545,  0.22080038,  0.32727194],
#       [ 0.17892276,         nan,         nan],
#       [ 0.26853548,         nan,         nan],
#       [ 0.86460043,  0.78827094,  0.96660502],
#       [ 0.41045599,         nan,         nan]])
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • That seems to be another possible solution but as you indicate it is not efficient, at least no more efficient than the loop. 870 microseconds for pandas versus 7.1 microseconds for the loop. – user10853 Sep 18 '17 at 00:04