1

I have a list of numpy arrays - for example:

Lets call this LIST_A:

[array([  0.        , -11.35190205,  11.35190205,   0.        ]),
 array([  0.        ,  36.58012599, -36.58012599,   0.        ]),
 array([  0.        , -41.94408202,  41.94408202,   0.        ])]

I have a list of lists that are indicies for each of the numpy arrays in the above list of numpy arrays:

Lets call this List_B:

[['A_A', 'A_B', 'B_A', 'B_B'],
 ['A_A', 'A_D', 'D_A', 'D_D'],
 ['B_B', 'B_C', 'C_B', 'C_C']]

I want to create a pandas dataframe from these objects and I'm not sure how I can do this without first creating series objects for each of the numpy arrays in LIST_A with their associated index in LIST_B (i.e. LIST_A[0]'s index is LIST_B[0] etc) and then doing a pd.concat(s1,s2,s3...) to get the desired dataframe.

In the above case I can construct the desired dataframe as follows:

s1 = pd.Series(list_a[0], index=list_b[0])
s2 = pd.Series(list_a[1], index=list_b[1])
s3 = pd.Series(list_a[2], index=list_b[2])
df = pd.concat([s1,s2,s3], axis=1)

            0          1          2
A_A   0.000000   0.000000        NaN
A_B -11.351902        NaN        NaN
A_D        NaN  36.580126        NaN
B_A  11.351902        NaN        NaN
B_B   0.000000        NaN   0.000000
B_C        NaN        NaN -41.944082
C_B        NaN        NaN  41.944082
C_C        NaN        NaN   0.000000
D_A        NaN -36.580126        NaN
D_D        NaN   0.000000        NaN

In my actual application the size of the above lists are in the hundreds so I don't want to create hundreds of series objects and then concatenate them all (unless this is the only way to do it?).

I've read through various posts on SO such as: Adding list with different length as a new column to a dataframe and convert pandas series AND dataframe objects to a numpy array but haven't been able to find an elegant solution to a problem where hundreds of series objects need to be created in order to produce the desired dataframe.

codingknob
  • 11,108
  • 25
  • 89
  • 126

1 Answers1

1

Not quite different from your approach, but this should be quite faster:

df = pd.DataFrame(dict(zip(list_b[i], list_a[i])) for i in range(len(list_a))).T         

Output:

             0          1          2
A_A   0.000000   0.000000        NaN
A_B -11.351902        NaN        NaN
A_D        NaN  36.580126        NaN
B_A  11.351902        NaN        NaN
B_B   0.000000        NaN   0.000000
B_C        NaN        NaN -41.944082
C_B        NaN        NaN  41.944082
C_C        NaN        NaN   0.000000
D_A        NaN -36.580126        NaN
D_D        NaN   0.000000        NaN
fsl
  • 3,250
  • 1
  • 10
  • 20