Convert pandas column of numpy arrays to numpy array of higher dimension

Question

I have a pandas dataframe of shape (75,9).

Only one of those columns is of numpy arrays, each of which is of shape (100, 4, 3)

I have a strange phenomenon:

data = self.df[self.column_name].values[0]

is of shape (100,4,3), but

data = self.df[self.column_name].values

is of shape (75,), with min and max are 'not a numeric object'

I expected data = self.df[self.column_name].values to be of shape (75, 100, 4, 3), with some min and max.

How can I make a column of numpy arrays behave like a numpy array of a higher dimension (with length=number of rows in the dataframe)?

Reproducing:

    some_df = pd.DataFrame(columns=['A'])
    for i in range(10):
        some_df.loc[i] = [np.random.rand(4, 6)]
    print some_df['A'].values.shape
    print some_df['A'].values[0].shape

prints (10L,),(4L,6L) instead of desired (10L, 4L, 6L),(4L,6L)

Hopefully not for long. I believe a solution would be the same for any python — Gulzar, Jun 16 '19 at 10:12
`np.stack(....values)` may create an array with the desired shape. It doesn't change the dataframe's own storage. — hpaulj, Jun 16 '19 at 10:21
@hpaulj That's it! I'll accept if you post it as an answer. I'm guessing it isn't the best performance-wise, but still works for me — Gulzar, Jun 16 '19 at 10:27

score 1 · Answer 1 · answered Jun 16 '19 at 10:15

What you're asking for is not quite possible. Pandas DataFrames are 2D. Yes, you can store NumPy arrays as objects (references) inside DataFrame cells, but this is not really well supported, and expecting to get a shape which has one dimension from the DataFrame and two from the arrays inside is not possible at all.

You should consider storing your data either entirely in NumPy arrays of the appropriate shape, or in a single, properly 2D DataFrame with MultiIndex. For example you can "pivot" a column of 1D arrays to become a column of scalars if you move the extra dimension to a new level of a MultIndex on the rows:

  A
x [2, 3]
y [5, 6]

becomes:

or pivot to the columns:

Now i have time to make this right. What is the code that pivots in each direction? — Gulzar, Jun 23 '19 at 12:33
`DataFrame.stack()`, after you break the lists into separate columns (see https://stackoverflow.com/questions/35491274/pandas-split-column-of-lists-into-multiple-columns for that). — John Zwinck, Jun 23 '19 at 14:37

score 1 · Accepted Answer · answered Jun 16 '19 at 15:34

In [42]: some_df = pd.DataFrame(columns=['A']) 
    ...: for i in range(4): 
    ...:         some_df.loc[i] = [np.random.randint(0,10,(1,3))] 
    ...:                                                                                  
In [43]: some_df                                                                          
Out[43]: 
             A
0  [[7, 0, 9]]
1  [[3, 6, 8]]
2  [[9, 7, 6]]
3  [[1, 6, 3]]

The numpy values of the column are an object dtype array, containing arrays:

In [44]: some_df['A'].to_numpy()                                                          
Out[44]: 
array([array([[7, 0, 9]]), array([[3, 6, 8]]), array([[9, 7, 6]]),
       array([[1, 6, 3]])], dtype=object)

If those arrays all have the same shape, stack does a nice job of concatenating them on a new dimension:

In [45]: np.stack(some_df['A'].to_numpy())                                                
Out[45]: 
array([[[7, 0, 9]],

       [[3, 6, 8]],

       [[9, 7, 6]],

       [[1, 6, 3]]])
In [46]: _.shape                                                                          
Out[46]: (4, 1, 3)

This only works with one column. stack like all concatenate treats the input argument as an iterable, effectively a list of arrays.

In [48]: some_df['A'].to_list()                                                           
Out[48]: 
[array([[7, 0, 9]]),
 array([[3, 6, 8]]),
 array([[9, 7, 6]]),
 array([[1, 6, 3]])]
In [50]: np.stack(some_df['A'].to_list()).shape                                           
Out[50]: (4, 1, 3)

after over a year, we meet again. I remember this method giving me many headaches, and wonder if this is the wrong way to go. Is there a standard way for handling tabular data which is long lists of multi dimensional arrays? [each with its own title, and same shape] — Gulzar, Oct 26 '20 at 16:01

Convert pandas column of numpy arrays to numpy array of higher dimension

2 Answers2

Linked