How to extract values from 3-d numpy into pandas dataframe with 3 columns

Question

I have data in the following format:

import numpy as np
arr=np.random.randint(0, 100, (10, 3, 5))

I want to turn this into a pandas dataframe with 3 columns, and have 10*5 (50) rows.

I tried doing:

df= pd.DataFrame(arr.tolist(), columns=['A','B','C'])

but I also want to extract values in each array into separate rows as well.

How can I achieve this?

Edit: I guess I can do this here for each index then concatenate them, but I am looking for a more efficient solution. Extracting an element of a list in a pandas column , because I want to this to all indexes in the arrays.

Edit 2: this is what I want to do:

df.apply(lambda col: col.str[0])

, but for each 5 indexes in the array more efficiently. Because my actual data is much larger (10,3, 50) etc.

Show you not-so-efficient attempt. A dataframe is a 2d structure, so can take a (50,3) array input. But to get there from a (10,3,5) takes more than a reshape. It first has to be transposed to a (10,5,3) or (5,10,3). (assuming the you want to "preserve" the size 3 dimension.) — hpaulj, Mar 08 '23 at 17:18
*actual data is much larger (10,3, 50)* - so your array is of shape `(10, 3, 50)` ? — RomanPerekhrest, Mar 08 '23 at 17:25
@RomanPerekhrest yes, I wanted to just give an example as to what the data looks like didn't think it would make a difference just so that it is easier to work with. But, I found a solution that handles each index individually, and but I'm looking for a faster way to do this. — prof32, Mar 08 '23 at 17:34

hpaulj · Accepted Answer · 2023-03-08T18:27:17.980

It's best if you show results, and explain what's right or wrong. Don't expect us to "run the code" (in our heads or computer). Anyways, the first attempt:

In [18]: arr=np.random.randint(0, 5*3*4, (5, 3, 4))  
In [19]: df= pd.DataFrame(arr.tolist(), columns=['A','B','C'])
In [20]: df
Out[20]: 
                  A                 B                 C
0    [7, 5, 26, 14]  [47, 46, 28, 45]  [59, 19, 26, 46]
1   [8, 40, 52, 12]  [37, 15, 52, 38]  [38, 42, 19, 39]
2   [8, 51, 39, 53]  [30, 53, 46, 34]  [51, 30, 24, 16]
3  [21, 20, 37, 38]   [39, 4, 37, 38]  [51, 39, 39, 16]
4  [15, 11, 46, 46]   [42, 56, 16, 5]    [7, 9, 52, 26]

That's (5,3) frame, with 4 element lists in each cell. tolist made a 3-level nested list.

Changing the array into (20,3), with the '3' as the last dimension:

In [21]: arr1 = arr.transpose(0,2,1).reshape(20,3); arr1
Out[21]: 
array([[ 7, 47, 59],
       [ 5, 46, 19],
       [26, 28, 26],
       [14, 45, 46],
   ...
       [46, 16, 52],
       [46,  5, 26]])

In [22]: df= pd.DataFrame(arr1, columns=['A','B','C'])

In [23]: df
Out[23]: 
     A   B   C
0    7  47  59
1    5  46  19
2   26  28  26
3   14  45  46
4    8  37  38
5   40  15  42
6   52  52  19
7   12  38  39
8    8  30  51
...
18  46  16  52
19  46   5  26

I'm not as good at pandas as numpy, but here's a way of assigning columns to a "blank" frame:

In [24]: df = pd.DataFrame(columns=['A','B','C'],dtype=int)

In [25]: df
Out[25]: 
Empty DataFrame
Columns: [A, B, C]
Index: []

In [26]: df['A']=arr[:,0,:].ravel()   # assign a (20,) array
In [27]: df['B']=arr[:,1,:].ravel()   
In [28]: df['C']=arr[:,2,:].ravel()

In [29]: df
Out[29]: 
     A   B   C
0    7  47  59
1    5  46  19
2   26  28  26
3   14  45  46
....

While df starts with 0 rows, after the column assignment it has full length:

In [32]: df
Out[32]: 
     A   B   C
0    7 NaN NaN
1    5 NaN NaN
2   26 NaN NaN
3   14 NaN NaN
4    8 NaN NaN
....

So as long as the number of rows is substantially larger than the number of columns, the one by one column assignment should be reasonably fast.

How to extract values from 3-d numpy into pandas dataframe with 3 columns

1 Answers1