0

I have a pandas dataframe of shape (75,9).

Only one of those columns is of numpy arrays, each of which is of shape (100, 4, 3)

I have a strange phenomenon:

data = self.df[self.column_name].values[0]

is of shape (100,4,3), but

data = self.df[self.column_name].values

is of shape (75,), with min and max are 'not a numeric object'

I expected data = self.df[self.column_name].values to be of shape (75, 100, 4, 3), with some min and max.

How can I make a column of numpy arrays behave like a numpy array of a higher dimension (with length=number of rows in the dataframe)?


Reproducing:

    some_df = pd.DataFrame(columns=['A'])
    for i in range(10):
        some_df.loc[i] = [np.random.rand(4, 6)]
    print some_df['A'].values.shape
    print some_df['A'].values[0].shape

prints (10L,),(4L,6L) instead of desired (10L, 4L, 6L),(4L,6L)

Gulzar
  • 23,452
  • 27
  • 113
  • 201

2 Answers2

1

What you're asking for is not quite possible. Pandas DataFrames are 2D. Yes, you can store NumPy arrays as objects (references) inside DataFrame cells, but this is not really well supported, and expecting to get a shape which has one dimension from the DataFrame and two from the arrays inside is not possible at all.

You should consider storing your data either entirely in NumPy arrays of the appropriate shape, or in a single, properly 2D DataFrame with MultiIndex. For example you can "pivot" a column of 1D arrays to become a column of scalars if you move the extra dimension to a new level of a MultIndex on the rows:

  A
x [2, 3]
y [5, 6]

becomes:

    A
x 0 2
  1 3
y 0 5
  1 6

or pivot to the columns:

  A
  0 1
x 2 3
y 5 6
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Now i have time to make this right. What is the code that pivots in each direction? – Gulzar Jun 23 '19 at 12:33
  • `DataFrame.stack()`, after you break the lists into separate columns (see https://stackoverflow.com/questions/35491274/pandas-split-column-of-lists-into-multiple-columns for that). – John Zwinck Jun 23 '19 at 14:37
1
In [42]: some_df = pd.DataFrame(columns=['A']) 
    ...: for i in range(4): 
    ...:         some_df.loc[i] = [np.random.randint(0,10,(1,3))] 
    ...:                                                                                  
In [43]: some_df                                                                          
Out[43]: 
             A
0  [[7, 0, 9]]
1  [[3, 6, 8]]
2  [[9, 7, 6]]
3  [[1, 6, 3]]

The numpy values of the column are an object dtype array, containing arrays:

In [44]: some_df['A'].to_numpy()                                                          
Out[44]: 
array([array([[7, 0, 9]]), array([[3, 6, 8]]), array([[9, 7, 6]]),
       array([[1, 6, 3]])], dtype=object)

If those arrays all have the same shape, stack does a nice job of concatenating them on a new dimension:

In [45]: np.stack(some_df['A'].to_numpy())                                                
Out[45]: 
array([[[7, 0, 9]],

       [[3, 6, 8]],

       [[9, 7, 6]],

       [[1, 6, 3]]])
In [46]: _.shape                                                                          
Out[46]: (4, 1, 3)

This only works with one column. stack like all concatenate treats the input argument as an iterable, effectively a list of arrays.

In [48]: some_df['A'].to_list()                                                           
Out[48]: 
[array([[7, 0, 9]]),
 array([[3, 6, 8]]),
 array([[9, 7, 6]]),
 array([[1, 6, 3]])]
In [50]: np.stack(some_df['A'].to_list()).shape                                           
Out[50]: (4, 1, 3)
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • after over a year, we meet again. I remember this method giving me many headaches, and wonder if this is the wrong way to go. Is there a standard way for handling tabular data which is long lists of multi dimensional arrays? [each with its own title, and same shape] – Gulzar Oct 26 '20 at 16:01