-1

I have data in the following format:

import numpy as np
arr=np.random.randint(0, 100, (10, 3, 5))

I want to turn this into a pandas dataframe with 3 columns, and have 10*5 (50) rows.

I tried doing:

df= pd.DataFrame(arr.tolist(), columns=['A','B','C'])

but I also want to extract values in each array into separate rows as well.

How can I achieve this?

Edit: I guess I can do this here for each index then concatenate them, but I am looking for a more efficient solution. Extracting an element of a list in a pandas column , because I want to this to all indexes in the arrays.

Edit 2: this is what I want to do:

df.apply(lambda col: col.str[0])

, but for each 5 indexes in the array more efficiently. Because my actual data is much larger (10,3, 50) etc.

prof32
  • 157
  • 6
  • try to reshape your array to (50,3) – Mouad Slimane Mar 08 '23 at 17:01
  • Show you not-so-efficient attempt. A dataframe is a 2d structure, so can take a (50,3) array input. But to get there from a (10,3,5) takes more than a reshape. It first has to be transposed to a (10,5,3) or (5,10,3). (assuming the you want to "preserve" the size 3 dimension.) – hpaulj Mar 08 '23 at 17:18
  • *actual data is much larger (10,3, 50)* - so your array is of shape `(10, 3, 50)` ? – RomanPerekhrest Mar 08 '23 at 17:25
  • @RomanPerekhrest yes, I wanted to just give an example as to what the data looks like didn't think it would make a difference just so that it is easier to work with. But, I found a solution that handles each index individually, and but I'm looking for a faster way to do this. – prof32 Mar 08 '23 at 17:34

1 Answers1

1

It's best if you show results, and explain what's right or wrong. Don't expect us to "run the code" (in our heads or computer). Anyways, the first attempt:

In [18]: arr=np.random.randint(0, 5*3*4, (5, 3, 4))  
In [19]: df= pd.DataFrame(arr.tolist(), columns=['A','B','C'])
In [20]: df
Out[20]: 
                  A                 B                 C
0    [7, 5, 26, 14]  [47, 46, 28, 45]  [59, 19, 26, 46]
1   [8, 40, 52, 12]  [37, 15, 52, 38]  [38, 42, 19, 39]
2   [8, 51, 39, 53]  [30, 53, 46, 34]  [51, 30, 24, 16]
3  [21, 20, 37, 38]   [39, 4, 37, 38]  [51, 39, 39, 16]
4  [15, 11, 46, 46]   [42, 56, 16, 5]    [7, 9, 52, 26]

That's (5,3) frame, with 4 element lists in each cell. tolist made a 3-level nested list.

Changing the array into (20,3), with the '3' as the last dimension:

In [21]: arr1 = arr.transpose(0,2,1).reshape(20,3); arr1
Out[21]: 
array([[ 7, 47, 59],
       [ 5, 46, 19],
       [26, 28, 26],
       [14, 45, 46],
   ...
       [46, 16, 52],
       [46,  5, 26]])

In [22]: df= pd.DataFrame(arr1, columns=['A','B','C'])

In [23]: df
Out[23]: 
     A   B   C
0    7  47  59
1    5  46  19
2   26  28  26
3   14  45  46
4    8  37  38
5   40  15  42
6   52  52  19
7   12  38  39
8    8  30  51
...
18  46  16  52
19  46   5  26

I'm not as good at pandas as numpy, but here's a way of assigning columns to a "blank" frame:

In [24]: df = pd.DataFrame(columns=['A','B','C'],dtype=int)

In [25]: df
Out[25]: 
Empty DataFrame
Columns: [A, B, C]
Index: []

In [26]: df['A']=arr[:,0,:].ravel()   # assign a (20,) array
In [27]: df['B']=arr[:,1,:].ravel()   
In [28]: df['C']=arr[:,2,:].ravel()

In [29]: df
Out[29]: 
     A   B   C
0    7  47  59
1    5  46  19
2   26  28  26
3   14  45  46
....

While df starts with 0 rows, after the column assignment it has full length:

In [32]: df
Out[32]: 
     A   B   C
0    7 NaN NaN
1    5 NaN NaN
2   26 NaN NaN
3   14 NaN NaN
4    8 NaN NaN
....

So as long as the number of rows is substantially larger than the number of columns, the one by one column assignment should be reasonably fast.

hpaulj
  • 221,503
  • 14
  • 230
  • 353